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ABSTRACT 


Global Combat Support System - Marine Corp is a large logistics system designed to re¬ 
place numerous legacy systems used by the Marine Corps. While it has been in existence 
for a while, its intended potential has not been fully realized. Therefore, various teams are 
working hard to develop the analytics that will benefit the community. With the growth 
of data, the only way these analytics (in Structured Query Language [SQL]) will run ef¬ 
ficiently will be on proprietary hardware from Oracle. This research looks at running the 
same analytics on commodity hardware using Hadoop Distributed File System and Java 
Map Reduce. The results show that while it takes longer to program in Java (over SQL), 
the analytics are just as, or even more powerful ,as SQL, and the potential to save on hard¬ 
ware cost is significant. 
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CHAPTER 1: 

Introduction 


Big data has become a buzz word in both the government and private sectors. That does 
not mean that big data is not a worthwhile venture. There is no hiding from the fact that 
data continues to grow, and processing enormous amounts of data using conventional tech¬ 
niques can become unmanageable. For example, the Pentagon is attempting to expand its 
worldwide communications network to handle yottabyte (YB)s (10^^ bytes) of data (A YB 
one trillion terabyte (TB)s) [1]. As the data increases to new thresholds, current database 
architectures struggle to keep up. This thesis examines the Global Combat Support System 
- Marine Corp (GCSS-MC) database and how to apply a big data solution. 


1.1 The Problem 

The amount of data being stored in database is increasing both in the government and pub¬ 
lic sectors. The United States Marine Corp (USMC) is no different, and the need to access, 
store, and process large amounts of data exists. The USMC is currently working on an 
enterprise resource planning (ERP) system that will replace all of its legacy logistics sys¬ 
tems. The solution that has been developed is called GCSS-MC. GCSS-MC is a complex 
undertaking and has seen its share of problems during development. The system is making 
significant achievements and is becoming a useful resource to the USMC. However, the 
design of this system has taken a significant amount of time to implement while the data 
continues to grow. 

Initially, the GCSS-MC did not consider a big data element in the system. But as the 
amount of data and desire to keep data is increasing, the USMC needs to find ways to add 
a big data solution to GCSS-MC. This thesis will explore a method to implement a big 
data element into GCSS-MC, by adding a Hadoop Distributed File System (HDFS) cluster. 
Furthermore, a method will be discussed to show no hardware changes will need to be 
made to the existing GCSS-MC architecture. 
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1.2 Research Questions 

The research and work of this thesis will be focused on addressing the following questions: 

• What would an architecture look like that adds a big data element to GCSS-MC? 

• If an architecture can be developed, what modifications would the GCSS-MC archi¬ 
tecture need? 

• How can the data contained in GCSS-MC be imported into HDFS? 

• What type of analytics can Hadoop provide for GCSS-MC data? 

• How will the data get back to the GCSS-MC database? 

1.3 Contributions 

The GCSS-MC database contains structured data and is stored in a relational database. This 
data will be accessed and then parsed and stored in the HDFS ecosystem. The data will then 
be used in HDFS to run analytics for the GCSS-MC system. The interesting idea in that 
concept is that HDFS will be used to examine all of the data in the corpus and then perform 
some calculations and return the data in a different format to the GCSS-MC database. 
This becomes particularly useful when there is only a small amount of data that needs to 
be derived and displayed from the entire data corpus. This is illustrated in two separate 
analytic programs. The programs are written to show that any analytic that can be run in a 
typical database operation can be accomplished in HDFS. Furthermore, HDFS can be used 
to perform the exact same queries that a Structured Query Language (SQL) statement can 
perform. This is increasingly interesting when the data exceeds the capability of standard 
database designs. 

1.4 Thesis Outline 

The rest of the chapters in this thesis will further expand on the aforementioned ideas and 
research questions. Below is a brief outline of what the reader can expect to find in each 
chapter. 

Chapter 2 serves as a starting point for the technology covered in this thesis. The chapter 
reviews the current architecture and status of the GCSS-MC system. The chapter also 
discusses the start of the MapReduce paradigm through the introduction of Jeffrey Dean 
and Sanjay Ghemawat’s paper on the MapReduce. Furthermore, an explanation and walk 
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through of the canonical WordCount example is given. The chapter wraps up with an 
introduction of the HDFS architecture. 

The experiment architecture is introduced in Chapter 3. The chapter begins with some 
statistics to further illustrate the need for big data solutions. The chapter then outlines how 
a big data element could be added to the current GCSS-MC architecture. Finally, it explains 
how the Hadoop cluster is created, and some implementation decisions that were required. 

Chapter 4 details the specifics of the experiment. First, there is a discussion of the 
GCSS-MC data and how it was accessed. Then the chapter delineates how the data is 
parsed and stored in the Hadoop ecosystem. There are also two separate analytics dis¬ 
cussed in Chapter 4. The first analytic is the Top 100 National Stock Number (NSN) 
program which will be shown in detail. The second analytic is a program called Alpha, an 
example of SQL simulation. Alpha is also discussed and shown in detail. 

The final chapter summarizes the findings and reiterates the research questions that are 
presented in the thesis. Chapter 5 also discusses some of the possible future work. Such 
as: adding big data tools (Hive, Squoop, Hbase, etc.), running the experiment code on a 
production level HDFS ecosystem, and further optimizing the code. 
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CHAPTER 2: 
The Current State 


The Department of Defense (DOD) creates, monitors, and views petabyte (PB)s of data 
every day. The current data management software and hardware make it very difficult 
to manage, organize, and parse large datasets. Moreover, the data resides across several 
databases. This infrastructure makes it difficult, or even impossible, for a user who needs 
to aggregate data from several databases to make an intelligent deductions on that data. The 
DOD has significant work remaining to streamline all of their data into one space. There 
are several projects working on a solution to place all of the data in a cloud environment 
that allows all users to access the data payload. These programs vary greatly across the 
DOD and no front-runner sets the standard. Even before the DOD can start using the cloud 
as a large data store, several steps need to take place to get the DOD ready to move to a 
cloud environment. 

The DOD took on the challenge of bringing all of the legacy business systems into a co¬ 
hesive system in the 1990s. DOD’s business systems are information systems, including 
financial and non-financial systems that support DOD business operations, such as civil¬ 
ian personnel, finance, health, logistics, military personnel, procurement, and transporta¬ 
tion [2]. This process has become known as the ERP and entails an automated system 
using commercial off the shelf (COTS) software consisting of multiple, integrated func¬ 
tional modules that perform a variety of business related tasks such as general ledger ac¬ 
counting, payroll, and supply chain management [2]. ERP processes across the DOD have 
been widely publicized for being behind schedule and cost overruns (because of the size, 
complexity, and significance to the DOD). ERP has been placed on the Government Ac¬ 
countability Office (GAO) high-risk list [2]. 

In total, there are nine ERP solutions being developed for the DOD. All of these systems 
hope to replace over 500 legacy systems. Replacing these systems in to aggregate systems 
data will save the DOD money and give the end user a more powerful system that they 
can use to help manage their work and use to run analytics that can possibly help them 
improve their work methods. The ERP solution is an ongoing process, that recently has 
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put in significant efforts towards a successful project. The USMC solution for ERP is the 
GCSS-MC. The GCSS-MC will replace legacy systems and tailor a solution that is useful 
to both the Marine in the field and the Marine shipping the supplies. 


2.1 GCSS-MC: The USMC ERP Solution 

The GCSS-MC system was designed to run off of COTS architecture and combine all of 
the USMC legacy business systems in to one system. This process has been ongoing since 
November 2003, and it was intended to deliver integrated functionality across the logistics 
areas. [3] This process has been incremented slowly through the delivery of COTS archi¬ 
tecture to bring all of the USMC logistic tools to one place. Figure 2.1 is an example of 
the system architecture, from [4]. The program has had its setbacks and successes. The 
overwhelming process of replacing up to forty-eight legacy systems is not easily accom¬ 
plished [3]. This system is attempting to design a replacement for all legacy systems in 
one place, which makes the GCSS-MC system incredibly complex. The system has to be 
designed to meet the needs of the USMC both at the garrison and in the field, all while 
maintaining real time data updates over the entire system. One can see that this is a dif¬ 
ficult undertaking and should not be taken lightly. The point of this thesis is not to take 
on the challenge of what should or should not have been done either during design or the 
implementation of GCSS-MC. Moreover, this thesis does not aim to replace and change 
the current architecture of the GCSS-MC system. The main aim of this thesis is to show a 
proof of concept that big data analytics can be added to the current GCSS-MC architecture. 
GCSS-MC uses proven Oracle technology to give the users a reliable relational database. 
The work on GCSS-MC is not done and its implementation process continues, but does not 
yet incorporate a big data element. Big Data elements would allow the USMC the ability to 
store and analyze more data. Currently the only data that GCSS-MC addresses is structured 
data. 
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Figure 2.1: GCSS-MC Architecture 


2.2 Big Data 

Big data has become a buzz word both in industry and the government sector; evidence 
of this is shown when President Obama started a Big Data Research and Development 
Initiative in March 2012 [5]. It has helped shift how data is thought about and what can 
be done with data. Big data does not answer all questions and there are limitations to what 
it can do. Big data can take the form of MapReduce, NoSQL, Big Table, or something 
completely different. The Big data industry changes rapidly, and new tools and architecture 
are constantly being introduced. These tools all provide some benefit and some drawbacks. 
No one technology provides everything an end user would want. There is currently no 
defacto standard and the arguments exists as to what technology is the best. The simple 
answer is that they all provide something, and it really depends on what type of data, how 
much data, and what the desired analytics are. 
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2.3 MapReduce 

One of the major eore teehnologies for big data is MapReduee. This proeess is used heav¬ 
ily at Google and in Apaehe’s HDFS system [6], [7]. Google’s development in the Map 
Reduee spaee is largely what pushed the open souree eommunity to answer with HDFS [8]. 
Google’s efforts were doeumented in Jeffrey Dean and Sanjay Ghemawat’s paper on the 
MapReduee paradigm [6]. This paper is what gave the data seienee eommunity a whole 
a look at how Google operated their eomputer eluster to perform a large number of tasks. 
The Google MapReduee paper was the first publieation illustrating a elustered MapReduee 
framework. Sinee this paper has been published several other models have been developed 
Google’s model. Apache Hadoop is an example of cluster computing environment created 
after this paper was published. This paper is accepted as a seminal paper and has been cited 
in thousands of other publications. 

MapReduce is a programming model that was developed by Google as an answer to how 
to process and create large amounts of data. At Google, MapReduce jobs are run every¬ 
day and they process more than twenty PBs of data per data [6]. The programming model 
was developed to abstract the complexity of cluster computing away from the programmer. 
Parallel computing is extremely difficult, and if the programmer has to handle how to par¬ 
allelize the computation, how the system distributes the data around the cluster, failures of 
cluster nodes, and as well as the data, the task can be overwhelming. Google developed a 
system and a programming model that abstracts that from the user. The programmer can 
then write code that can be executed on the cluster though the Google File System [9]. 
The program then only has to deal with writing the code to handle the calculation and the 
distribution of the data; how it is parallelized is all handled by the underlying system. This 
allows the programmer to focus on the algorithms and not specify the parallel features. 

Google found that most of their data computations involved mapping the data to a compu¬ 
tation and then storing the resultant key/value pairs. Google further took their inspiration 
from the map and reduce primitives that were available in Lisp [6]. Furthermore, this led 
to the development of the MapReduce programming model. The programmer will specify 
both the map and reduce function to handle the data in key/value pairs. The map function 
will first take the data and map it to a logical record of intermediate key/value pairs. Then 
the intermediate key/value pairs are sent to the reduce function to apply the same function 
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to all of the values that are the same. The eanonieal example of this is the word eount 
problem that will be further diseussed in Seetion 2.4. 

Google foeused on implementing this eluster on a large number of eommodity personnel 
eomputer (PC)s with a Gigabit Ethernet network [9]. This allowed them to quiekly and 
eheaply build a large seale eluster. This idea of using eommodity PCs is important be- 
eause prior to this framework being developed, the eonventional thinking was to have a 
super eomputing environment whieh needed very powerful, expensive servers. The faet 
that Google showed that the eomputational power, storage, and speed eould be mimicked 
with a bunch of PCs allowed the parallel computing community to expand into other areas 
and make super computing more achievable to more people [10]. Google’s implementation 
also has thousands of machines, and with that many machines failures are common. Since 
failures are common, Google’s framework supports fault tolerance and allows a machine to 
be replaced, and since the machines are PCs, the cost to replace them is significantly lower 
than a server. 

The Google framework executes [6] the MapReduce program by first splitting the input 
files into pieces, these pieces typically range from 16-64MB per piece. These are then sent 
to worker nodes, and a map function is then started up on each worker node that has been 
assigned a data split. One of the worker nodes is designated as the master. The master is 
responsible for designating the worker nodes for both the map and reduce functions, he 
master will attempt to assign the tasks to idle nodes in the cluster. Each worker that is 
assigned a data input split will perform a map function on the input data and will parse 
the input data into intermediate key/value pairs. The values are written to the local disk 
through a buffer. The location of the intermediate values is then sent to the master. This is 
needed so the master knows where to tell the reduce workers where their input data resides. 
A reduce worker node will then make a call to get the data from the location passed to it 
by the master. The reduce node will read all of the data into its partition, and it will sort 
the intermediate data be the key value. The reduce node will then iterate over the data for 
each unique key value and send that to the reduce function specified by the program. The 
reduce node appends its output to the program specified output file. When all of the map 
and reduce tasks report completion to the master, the master will wake up the program and 
the MapReduce call is returned back to that program. A pictorial example of this process 
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is shown in Figure 2.2, from [6]. 



Input 

files 


Map Intermediate files Reduce 

phasr (on local disks) phase 


Output 

files 


Figure 2.2: Google Execution Overview 


2.4 Word Count 

The MapReduee programming paradigm ean be used to proeess all kinds of data and ean 
become very convoluted. The widely accepted "Hello World" of MapReduce is the word 
count program. This example has been used by Google [6] and is also used as an example 
by Hadoop [11]. Figure 2.3 is the source code for the WordCount program form the 
Hadoop web page [11]. This is the example code that we will step through to example the 
MapReduce process. 


10 



























1 

2 

3 

4 

5 

6 

7 

8 

9 

10 

11 

12 

13 

14 

15 

16 

17 

18 

19 

20 

21 

22 

23 

24 

25 

26 

27 

28 

29 

30 

31 

32 

33 

34 

35 

36 

37 

38 

39 

40 

41 

42 

43 

44 

45 

46 

47 

48 

49 

50 

51 

52 

53 

54 

55 

56 

57 

58 

59 

60 

61 


package 

org . myorg ; 


import 

java . io . lOException ; 

import 

java, util . * \ 


import 

org. apache.hadoop 

fs . Path ; 

import 

org .apache.hadoop 

conf.*; 

import 

org . apache.hadoop 

io . *; 

import 

org .apache.hadoop 

mapred . * 

import 

org . apache.hadoop 

util . *; 

public 

class WordCount { 



public static class Map extends MapReduceBase implements Mapper<LongWritable , Text, 

Text. IntWritable> { 

private final static IntWritable one = new IntWritable (1); 
private Text word = new Text(); 

public void map( LongWritable key, Text value, OutputCollector <Text , IntWritable> output, 
Reporter reporter) throws lOException { 

String line = value . toString (); 

StringTokenizer tokenizer = new StringTokenizer ( 1 i ne ); 
while (tokenizer. hasMoreTokens ()) { 
word . set (tokenizer . nextToken ()); 
output . collect (word , one ); 



public static class Reduce extends MapReduceBase implements Reducer<Text , IntWritable , 
Text , IntWritable >{ 

public void reduce(Text key, Iterator <IntWritable > values, OutputCollector <Text , 
IntWritable> output. Reporter reporter) throws lOException { 
int sum = 0; 

while ( values . hasNext ()) { 
sum += values . next (). get (); 

) 

output . collect (key , new IntWritable (sum)); 



public static void main (S tr ing [ ] args) throws Exception { 
JobConf conf = new JobConf (WordCount. class ); 
conf . setJobName (" wordcount" ); 

conf. setOutputKeyClass(Text . class ); 

conf. setOutputValueClass( IntWritable . class ); 

conf. setMapper Class (Map. class ); 
conf. setCombinerClass( Reduce . class ); 
conf. setReducerClass( Reduce . class ); 

conf. setInputFormat(TextInputFormat .class ); 
conf. setOutputForiTiat(TextOutputFormat. class ); 

FileInputFormat . set Input Paths (conf, new Path(args [0])); 
FileOutputFormat . setOutputPath (conf , new Path(args[l])); 

JobClient . runJob(conf); 

) 

) 


Figure 2.3: Fladoop WordCount Program 
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The first section of code (Figure 2.4) allows the program access to the Hadoop libraries 
needed to execute the code in the Hadoop environment. The first two packages are Java 
packages that are not specific to the Hadoop Libraries. The first imported package is the 
exception package. This allows the program to throw the exceptions it encounters and will 
provide a graceful shutdown and some help in debugging. The next import is the Java util¬ 
ities package that allows the program to have access to the iterator class and the string tok- 
enizer class. The next five packages are all Hadoop specific. The org.apache.hadoop.fs.Path 
package allows the program to access the Hadoop file system. This is needed both for the 
import of the files needed to run the program and the export path of the result. The next 
package, org.apache.hadoop.conf.*, is needed to set the job up. This is the part of the code 
that is executed in the main method. The package org.apache.hadoop.io.* is need to write 
to the read into the program and write out of the program and is additionally needed to write 
to logs,stdout, and stderr. The Hadoop utility package, org.apache.hadoop.util.*, is used to 
report progress on the program during execution. Finally, the org.apache.hadoop.mapred.* 
package is where the program is getting access to the majority of the classes need. This 
package includes the class definitions for the map and reduce methods. It will also give the 
program the class definitions for the input format, the input split, how the job is configured, 
the output collector, and the output format. 


1 import java . io . lOException ; 

2 import java, util 

3 

4 import org.apache.hadoop.fs.Path; 

5 import org.apache.hadoop.conf.*; 

6 import org.apache.hadoop.io.*; 

7 import org . apache . hadoop . mapred .* ; 

8 import org . apache . hadoop . u t i 1 .* ; 

Figure 2.4: Fladoop WordCount Program: Libraries 


The map class is going to be the next area of focus (Figure 2.5). The program first has to 
make the map deceleration. 

1 public static class Map extends MapReduceBase implements Mapper<LongWritable , Text, 

2 Text, IntWritable> 
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In this case, Map is the program defined class name and it extends all of the properties of 
the MapReduceBase class. The MapReduceBase is what allows the Map class to be called 
in the job. Next the Map class must implement the Mapper as an interface. This is the 
class that actually maps the input to the intermediate output. The program must specify 
the input key/value data type and the output key/value pair data type that will become the 
intermediate data. Here the input key is the byte offset of the line from the input document 
and the value is the entire line of the input. The map class then must declare a map method. 

public void map( LongWritable key. Text value, OutputCollector <Text , IntWritable> output. 
Reporter reporter) 


The method will again specify the input key/value pairs. The method will declare the 
OutputCollector interface and set the key/value output type and give the OutputCollector 
a name. The Reporter type is what will report status back to the Hadoop system. This 
map method set the output to be Text and an IntWritable, both of which are Hadoop types. 
Finally the output statement collects writes the intermediate data out. 

output . collect (word , one); 


public static class Map extends MapReduceBase implements Mapper<LongWritable , Text, 

Text, IntWritable > { 

private final static IntWritable one = new IntWritable (1); 
private Text word = new Text(); 

public void map( LongWritable key, Text value, OutputCollector <Text , IntWritable > output, 
Reporter reporter) throws lOException { 

String line = value . toString (); 

StringTokenizer tokenizer = new StringTokenizer (line ); 
while (tokenizer . hasMoreTokens ()) { 

word . set ( tokenizer . nextToken ()); 
output . collect (word , one); 



Figure 2.5: Fladoop WordCount Program: Map Class 


Next the reduee elass will take over (Figure 2.6). The reduee elass has to be speeified in 
the program and must aeeept the input data types defined by the map data type output. The 
reduee elass deelaration is the first thing that must be eonsidered. 
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public static class Reduce extends MapReduceBase implements Reducer<Text , 
IntWritable , Text, IntWritable > 


As with the map class, in order to run as a job the reduce class must also extend the MapRe¬ 
duceBase class. Then, the reduce class implements the Reducer interface and must declare 
the key/value input and output data types. This example shows the case where the reduce 
class key/value input and output are the same; this is not a constraint. The reduce class 
could output whatever the programmer chooses as long as it is a Hadoop data type. The 
reduce class must define a reduce method. 

public void reduce(Text key, Iterator <IntWritable > values, 

OutputCollector <Text , IntWritable > output. Reporter reporter) 


The reduce method declaration is similar to the map declaration. The OutputCollector and 
Reporter declarations are the same. The difference here is the declaration of the input 
key value pairs. The method must define the key type and then an iterator for the input 
value type. This is because before the intermediate data gets to the reduce class, it will 
be shuffled, and the reduce class will have all the the intermediate data that has the same 
key. So the reduce method will set the key and the iterate over all the values that exist in 
the intermediate data that have the same key that is set. This is where the output key/value 
pairs are reduced in the manner specified by the program. In this example, the final value 
for each key is just the sum of all of the values. Finally, the reduce class has to set the 
output. This is a similar to call the map output statement. 

output . collect (key , new IntWritable (sum )); 


public static class Reduce extends MapReduceBase implements Reducer<Text , IntWritable , 
Text , IntWritable >{ 

public void reduce(Text key. Iterator <IntWritable > values, OutputCollector <Text , 
IntWritable> output. Reporter reporter) throws lOException { 
int sum = 0; 

while ( values . hasNext ()) { 

sum += values . next (). get (); 

1 

output . collect (key , new Int Writ able (sum )); 



Figure 2.6: Fladoop WordCount Program: Reduce Class 
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The final piece the program must implement is a main method (Figure 2.7). The main 
method is responsible for the set up of the MapReduce job. This is done by declaring a 
JobConf object and telling it where the map and reduce methods are found. This example 
the classes are found in the WordCount class. 

JobConf conf = new JobConf (WordCount. c 1 as s ); 

The JobConf typically sets the mapper, reducer, input format, output format [12], but it 
also allows the programmer to manipulate the job here. The programmer has the access 
here to set specific combiner classes, use distributed cache, use custom comparators, and 
even manipulate how the program executes by changing memory requirements for the map 
and reduce tasks. The JobConf also allows the setting of the input and output path. In this 
example it is set form the command line arguments. 

JobConf conf = new JobConf (WordCount. c 1 as s ); 

conf. setJobName("wordcount'' ); 

conf. setOutputKeyClass(Text. class ); 

conf. setOutputValueClass(IntWritable . class ); 

conf. setMapperClass (Map. class ); 

conf. setCombinerClass (Reduce . class ); 

conf. setReducerClass (Reduce . class ); 

conf. setInputFormat(TextInputFormat. class ); 

conf. setOutputFormat(TextOutputFormat . class ); 

FileInputFormat . setInputPaths (conf , new Path ( args [0])); 

FileOutputFormat . setOutputPath ( conf , new Path ( args [ 1 ])); 

JobClient . runJob(conf); 



The final piece to wrap up here is a walk through of the WordCount program execution. 
The idea is to use a small data set and walk through the program execution to follow the 
data flow and data transformation. This example will look at the nursery rhyme Jack and 
Jill (Figure 2.8). 
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1 Jack and Jill went up the hill 

2 To fetch a pail of water. 

3 Jack fell down and broke his crown, 

4 And Jill came tumbling after 

Figure 2.8: Jack and Jill Nursery Rhyme 


For the purpose of this walk through, we will only consider one map task and one reduce 
task. The first thing that will happen is the first line of the file will be read in by the Map 
class. The first line will come in with a byte offset of 0, therefore the key value be zero 
and the value of the line is "Jack and Jill went up the hill". Figure 2.9 shows the input 
key/values pairs for the whole text. 

1 0 Jack and Jill went up the hill 

2 31 To fetch a pail of water. 

3 57 Jack fell down and broke his crown, 

4 93 And Jill came tumbling after 

Figure 2.9: Byte Offset for Jack and Jill Nursery Rhyme 


The line is then parsed using a StringTokenizer, which sets each word as a token. After the 
tokens are set, a while loop is set up on the condition that there are more tokens. This while 
loop will then set a Text variable to the String value of each token and then set the output 
to the word and one. This sends an intermediate data value of ("word", 1) for each word in 
the line. The intermediate value of the first line will be as follows: 


1 Jack 1 

2 and 1 

3 Jill 1 

4 went 1 

5 up 1 

6 the 1 

7 hill 1 


This process is repeated for each line in the input file producing intermediate outputs of 
("word", 1) for each word that occurs in the file. This produces the intermediate data that 
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will then be sorted and shuffled prior to going to the reduce method, Figure 2.10 illustrates 
the intermediate values for the entire file. 


1 Jack 1 

2 and 1 

3 Jill 1 

4 went 1 

5 up 1 

6 the 1 

7 hill 1 

8 To 1 

9 fetch 1 

10 a 1 

11 pail 1 

12 of 1 

13 water 1 

14 Jack 1 

15 fell 1 

16 down 1 

17 and 1 

18 broke 1 

19 his 1 

20 crown 1 

21 And 1 

22 Jill 1 

23 came 1 

24 tumbling 1 

25 after 1 


Figure 2.10: Intermediate Key/Value Pairs from the Map Method 


The sorting and shuffling process then takes over and sorts the data. The data is sorted 
into alphabetical order based on the key. The shuffling process would shuffle the data into 
logical units and then send data in logical sets to reduce tasks. Again, there is only one 
reduce task so it will get the entire data set. The input to the reduce method is illustrated in 
Figure 2.11. 
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1 And 1 

2 Jack 1 

3 Jack 1 

4 Jill 1 

5 Jill 1 

6 To 1 

7 a 1 

8 after 1 

9 and 1 

10 and 1 

11 broke 1 

12 came 1 

13 crown 1 

14 down 1 

15 fell 1 

16 fetch 1 

17 hill 1 

18 his 1 

19 of 1 

20 pail 1 

21 the 1 

22 tumbling 1 

23 up 1 

24 water 1 

25 went 1 


Figure 2.11: Shuffled Intermediate Key/Value Pairs from the Map Method 


Notice this program looks at "and" and "And" as different words. This is how the map 
method is programmed, and if this result would be undesirable, then programmer could 
use string manipulation or Regex to get rid of this. The reduce method will then take over. 
Here, the reduce method will take each key in and iterate over all values for that key. In 
this reduce method, a int variable is set and used to sum the number of occurrences of each 
word. For instance, the word "Jack" is seen twice, the reducer will set the value of sum to 
one on the first occurrence of "Jack" by the call: 

1 sum += values . next (). get (); 


At this point, the sum is set to one. On the next occurrence of "Jack", the same sum call is 
executed, and then sum is set to two. This process occurs until there are no more values for 
the key "Jack". In this case, there are no more values, so the reduce class then makes the 
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call to write the output of the key and the sum by: 

output . collect (key , new IntWritable (sum )); 


The reduee method will get ealled for eaeh unique key in the intermediate data and write 
one output statement for each key until is has seen all of the keys in the intermediate data. 
The final result in this example is seen in Figure 2.12. 

And 1 
Jack 2 
Jill 2 
To 1 
a 1 

after 1 
and 2 
broke 1 
came 1 
crown 1 
down 1 
fell 1 
fetch 1 
hill 1 
his 1 
of 1 
pail 1 
the 1 

tumbling 1 
up 1 
water 1 
went 1 


Figure 2.12: Final Output of WordCount 


2.5 Hadoop 

The Hadoop Distributed File System is an open souree projeet that was modeled after the 
Google Arehitecture outlined in Jeffery Dean and Sanjay Ghemawat’s MapReduce paper 
[6]. Hadoop started out as an open souree web seareh engine ealled Nuteh [7]. The Nuteh 
projeet was having some trouble handling the distributed nature of eomputations that were 
needed for a web seareh engine. Then, from the Google papers on MapReduee and the 
Google File System (GFS), the projeet began to eateh hold [7], [6], [9]. At that time, 
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Yahoo! became interested in the project, and Hadoop split from the Nutch project and 
became what it is today. HDFS is the distributed file system that supports MapReduce 
framework, and Figure 2.13 illustrates the HDFS architecture, from [11]. The distributed 
file system is set up to be fault tolerant and spread data over several nodes in a cluster. The 
data replication factor is nominally set to three. However, this can be changed and adapted 
to meet the specific needs of a given cluster. The most basic cluster, a single node, has the 
following services running NameNode, JobTracker, TaskTracker, Secondary NameNode, 
and DataNode. 



The NameNode is known as the master in the cluster. The NameNode is responsible for the 
file system namespace [13]. The NameNode is the node that manages the whole cluster. 
It maintains the file system tree and stores all the metadata for all of the directories and 
files within the distributed file system. The NameNode is the most important node in the 
cluster because it is needed to maintain all of the files in the distributed file system. If the 
NameNode were lost than all of the files in the distributed file system would be lost because 
the NameNode maintains the file structure to be able to put the blocks back together. The 
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NameNode stores the metadata and file tree in two files; the namespaee image and an edit 
log [7]. Since, HDFS stores all of the files in blocks, the NameNode will keep a reference 
to every block that exists in the file system. 

The Secondary NameNode is there to backup the NameNode. However, this is not a true 
backup in the sense that if the NameNode fails; the Seocndary NameNode would take over. 
The main purpose of the Secondary NameNode is to periodically merge the namespace 
image with edit log. This will prevent the edit log from becoming too large [7]. The 
Secondary NameNode keeps a copy of the merged edit log and file system image. It is 
important to note that this is not real time, that is to say that if the HDFS is restored from 
the Secondary NameNode image, there will be lost data from the time difference of when 
NameNode failure occurred and the last edit log merge. HDFS does provide a way for the 
NameNode to write the persistent state to several file systems to reduce the chance of all 
the file systems failing at once. 

The JobTracker is responsible for accepting jobs and then dividing the jobs into tasks and 
assigning those tasks to DataNodes [14]. The JobTracker will try to do the best it can 
to maintain data locality, meaning that it will try to give the tasks to the DataNodes that 
physically have the blocks of data the task is to be executed on. The JobTracker will access 
the NameNode to be able to determine which nodes physically have the data. This is done to 
cut down on the amount of data that has to be transferred over the network. The JobTracker 
will then contact TaskTracker Nodes to determine whether the node is available to run a 
task. The TaskTracker is the service running on the DataNodes that will communicate with 
the JobTracker for processing tasks it has been assigned. 
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Figure 2.14: FIDFS JobTracker Interaction 


The DataNode is the true worker in HDFS. The DataNodes are what store and retrieve 
blocks of data. They inform, the NameNode of the data blocks they have and the JobTracker 
their status of currents jobs running and/or their availability status. The DataNodes are also 
the nodes responsible for actually running the MapReduce code on their blocks of data. 
The DataNodes will communicate with other DataNodes when they need to send or share 
data. The direct access reduces the amount of traffic needed to be sent if the nodes were 
required to go though the NameNode to communicate with another DataNode. 
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CHAPTER 3: 

The Experiment Design 


The DOD is continuing to create and store all types of data [15]. The technology for 
how to deal with data is continuously changing. Most current DOD solutions deal with 
storing data in a relational database and use SQL or Procedural Language/Structured Query 
Language (PL/SQL) to provide the analytics for this data. SQL and PL/SQL provide a large 
range of analytics and are very useful for many applications, but when the size of data to 
analyze becomes large, this approach hits its limitations. Adding a Big Data element to a 
relational database will provide additional means to store and analyze data. 


3.1 Why Add a Big Data Element? 

The world is producing petabytes of data daily and the amount of data being stored is 
increasing. A few examples of this are [13]: 

• The New York Stock Exchange generates about one TB of new trade data per day. 

• Facebook hosts approximately 10 billion photos, taking up one PB of storage. 

• Ancestry.com, the genealogy site, stores around 2.5 PB of data. 

• The Internet Archive stores around 2 PB of data, and is growing at a rate of 20 TB 
per month. 

• The Large Hadron Collider near Geneva, Switzerland, will produce about 15 PB of 
data per year 

The large scale of this data makes it difficult to store and process the data in a relational 
database. A big data element would allow some additional flexibility with storing and 
analyzing data. 

Specifically, the addition of an HDFS cluster would allow quick processing of the entire 
data set. The true power of Hadoop is that it does process the entire data set [7]. This 
gives the ability to quickly analyze the entire data set. Hadoop can look at all of the data 
in the Database and return whatever analysis the programmer desires. Hadoop truly puts 
the power of all of the data in your corpus at your fingertips. There is little to no concern 
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that one must perform a large amount of table seans to return analyties; Hadoop by its 
very nature performs a table sean every time a program is run. That is its true power and it 
plaees no limitations on the possible analytics that can be run. Hadoop uses the MapReduce 
paradigm and forces the programmer to deal with key/value pairs, but these can be chained 
together to find analytics for all sorts of interesting problems. 


3.2 Adding a Big Data Element to GCSS-MC 

The current architecture of GCSS-MC does not have an architecture to support big data 
processing. The architecture that we are purposing in this thesis shows how it can be 
added as an additional element and integrate with the current architecture as long as it has 
Internet Protocol (IP) connectivity. The work we did in this thesis shows how a separate 
HDFS cluster can interact with separate database. We did not have access to the GCSS-MC 
Database so we used an Oracle Database to simulate the GCSS-MC Database. Figure 3.1 
shows the setup the we used for our experimentation. 



10 Node HDFS Cluster 
Virtualized on 2 Machines 

Figure 3.1: Experiment Architecture 
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This experiment arehiteeture does abstraet away some of the diffieulty that would be expe- 
rieneed fully integrating a HDFS eluster into the GCSS-MC system. However, we believe 
that it does fully support the proof of eoneept that it ean be done and done effeetively 
without disrupting the eurrent system. During the setup and testing of this thesis the only 
stipulation that we found for the eluster to work effeetively was that it had to have IP eon- 
neetivity to the Oraele Database. 

The arehiteeture uses is all open souree teehnologies whieh are readily available to ev¬ 
eryone. We ehoose to use Ubuntu 12.04 LTS for the operating system to run the Hadoop 
eluster and Oraele IIG as the Relational Database to aeeess and write to beeause that is 
what the GCSS-MC system is eurrently using for their Database. The ehoiees of the op¬ 
erating system and type of database ean be ehanged and adapted to many other situations 
with a few modifieations. That is all of the software that was needed to run the experiment. 
There are several other Big Data abstraetions and tools that ean be used, but we felt that 
keeping the experiment to the eore HDFS was important to show this proof of eoneept and 
limit the amount of software needed to make the experiment funetional. 


3.3 Building a Hadoop Cluster 

We eonsidered several possibilities diseussions on the the best way forward to employ the 
Hadoop eluster. The initial reaetion was to simply run several virtual maehines on a large 
server to represent a eluster. But this thinking goes against the idea of using eommodity 
maehines linked together to gain additive power [6]. So we deeided to run a simple exper¬ 
iment on the a single Hadoop maehine versus a virtualized two node Hadoop eluster. The 
experiment ran several benehmarks on the two different setups and found that the speed up 
was hampered by the virtualized environment. Our eonelusions eame down to no matter 
how many virtual maehines were running ultimately they all had to read and write data to 
the same hard disk, whieh subsequently eauses a bottleneek. We found some other inter¬ 
esting faetors that ean be attributed to slower performanee in a Hadoop virtual environment 
as well, the full paper ean be found in the Appendix. There has been a lot of researeh in 
optimizing Hadoop in the virtual eomputing environment that has found tunable settings in 
Hadoop and the virtual hypervisor that ean give the same performanee as hardware [16]. 

Despite our findings we deeided to run a ten node eluster on two servers. This was largely 
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decided due to the available recourses and our aim to show a proof of concept (and not 
focous performance). Once we decided on our way forward we began to build the cluster 
that would be used for the experiment. The cluster was built on top of two servers running 
Ubuntu 12.04 LTS. The virtual environment that we choose is Oracle’s VirtualBox. The 
Nodes were split up on the two servers with seven nodes on server one and three on server 
two. The split was decided based on the available random access memory (RAM) on each 
server. 


Table 3.1: Experiment Server Specifications 


Specifications 

Server #1 

Server #2 

OS 

Ubuntu 12.04 (64 Bit) 

Ubuntu 12.04 (64 Bit) 

CPU(s) 

4 

4 

CPU MHZ 

1500 

1600 

HDTB 

2 

1 

RAM GB 

32 

16 

# of HDFS Nodes 

7 

3 


Once the the severs and virtual environment was set up, we built the 10 virtual machine 
(VM)s on the the servers. The next step was to start and create the Hadoop nodes on 
the VMs. The Hadoop build was done using Tom White’s Hadoop: The Definitive Guide 
[7] and Michael Noll’s Hadoop Tutorials [17]. Each node was built and tested separately 
prior to adding them to the cluster. This process could have expedited up by cloning the 
machines, however, we wanted to ensure the integrity of each machine and the cluster. 

Table 3.2: Experiment HDFS Node Specifications 


Specifications 

HDFS Nodes 

OS 

Ubuntu 12.04 (64 Bit) 

CPU(s) 

1 

HD GB 

100 

RAM GB 

4 


The process of installing Hadoop on a VM is not all that different from installing any 
software package. The Hadoop install is downloaded as a compressed file (.tar.gz). There 
are a few things that you must do as a prerequisite to the Hadoop install. First, you must 
ensure that you have an up-to-date version of Java (our testing has found Hadoop works 
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with Java 6 or Java 7). You must also configure SSH on the VM in order to allow Hadoop 
to communicate. Once this is done you have to edit several configuration files to set up 
the environment and then you are off and running. The configuration files are also used to 
control and optimize the cluster. See the appendix for full installation instructions. 
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CHAPTER 4: 
The Experiment 


This chapter will focus on two things: First how the data is taken from and loaded into 
the Hadoop ecosystem, then how the data is used in two separate analytic programs. The 
first program, is a home grown analytic tool that uses the GCSS-MC data to find the 100 
most frequent NSNs in the whole data set and then build tables based on the top 100 NSNs 
found. The second program is a simulated SQL program that is taken from a similar project 
that is asking for SQL analytics on the same data set. The program runs on the data and 
produces the same results that SQL would produce. The second program is done to show 
that Hadoop can simulate anything done in an SQL environment. We believe that this 
makes a strong case that Hadoop can be used to provide analytics on both structured and 
unstructured data sets. 

4.1 The GCSS-MC Data 

The GCSS-MC data that we received is a sample of actual production data. This data 
set was obtained from the USMC and used to assist the thesis experiment. The data was 
first loaded into an Oracle llg database. From the database we are able to see fourteen 
tables of GCSS-MC data. The first step in working with this structured data was to write a 
program to pull the data out of the relational database and move the data into the Hadoop 
ecosystem. There were several decisions made as to how to pull the data out and its format. 
The final choice was made to write a Java program to access the Oracle database using the 
Java database connect libraries and parse each table into separate files. The program was 
written to parse each table in the database into JavaScript Object Notation (JSON) format. 
A sample of the data in JSON format can be found in Figure 4.1. The NSNs are highlighted 
in the JSON formatted data. 

JSON format was chosen because it lends itself nicely to formatting between different 
databases. Also, a table can quickly be built from the format. This is important because 
the goal of this project is to pull all of the data from a relational database and run big data 
analytics and ultimately return back to a simplified relational database table. 
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{ 

"XXMC_MERIT_RETAIL_INVENTORY’': [ 

1 {"ACTIVITY_ADDRESS_CODE":"MMC100","BACK_ORDER_QUANTITY'':"0","CONTROL_ITEM_CODE":"null", 
"DAY30_USAGE_RD" :"null", "DUE_PRCfVISICWS" : "0", "DaE_STOCK" : "0" , "EXCH": "0", 

"FIXED_LEVEL":"0", "FLQAT_REORDER":"25", "FREEZE_CODE":"Y","FREEZE_REASON":"GHOST SN", 
"FREEZE_DATE":"2013-05-01 23:24:47.0", "GABF_DATE":"2013-12-08 11:31:33.0", 
"IiAST_TRANSACTION_DATE":"2013-10-15 13:24:01.0","MATERIAL_ID_CODE":"B","MOFFSET":"0", 
"NON_SYSTEM_ID_CODE" : "null", "NO_lST_RECEIPT" : "null" , "OH_PROVISIONS" : "0" , 

"OH STOCK SERVICEABLES":"40","OH_UNSERVICEABLES":"0","PHRASE_CODE": "4", _ 

'"PrTmE^NSN" : "4330010463399^ , "NCMENCLATURE" : "FILTER ELEMENT, FLUI" , ["RECORD_NSN" : " 4330010463399" ,| 
"REORDER_DATE":"2013-12-05 06:07:28.0","REORDER_POINT":"25","REQUISITION_OBJECTIVE":"33", 
"ROUTING_ADDRESS_CODE":"MMC300","ROUTING_IDENTIFIER_CODE":"MCI", "SEC_CODE":"U", 

"SPL_ALLOW" : "0", "STORE_ACCCiUNT_CODE" : "1", "SUPPLY_SOURCE_CODE" : "null", "TOTAL_MO_ALLOW" : "0", 
"UNIT_OF_ISSUE":"EA","UNIT_PRICE":"7.34","PROCESS_STATUS":"Y","RECCaiD_ID":"73906115", 
"CREATED_BY":"1277","CREATION_DATE":"2013-12-08 11:31:33.0","LAST_UPDATED_BY":"1277", 
"LAST_UPDATE_DATE":"2013-12-08 11:31:33.0","REQUEST_ID":"27221076","BATCH_ID":"GENEP54431990" , 
"EXTERNAL_APPLICATION": "merit", "REQUIREMENT_CODE":"IFUC","OPERATION_CODE":"61", 

"IIP_QUANTITY":"0"}, 


Figure 4.1: Sample GCSS-MC data in JSON Format 


There are, however, some eompromises that are made when using JSON format. When 
JSON, is used there is going to be some inerease in the size it takes to store that data. In our 
ease the Oraele database is approximately 1.25 gigabyte (GB) and when that data is taken 
out of the database and parsed into JSON it beeomes 5.3 GB. That is a data inerease faetor 
of 4.24. This experiment ean handle that level of storage inerease, but that is not the ease as 
the data beeomes larger than 1 TB. We diseussed this at length and ehose to move forward 
with JSON with the understanding that this would have to ehange to inerease the seale of 
the data. The main reason for us to maintain JSON is beeause keeping the meta-data in the 
data provides more flexibility in transforming and exporting the data dynamieally. 

Another option would have been to use a different format for the data that would reduee 
the size requirements. For example, the data eould have been pulled from the database and 
parsed into Comma Separated Value (CSV) format. The CSV format would reduee the 
data blow up faetor signifieantly. In the data set for this thesis, the inerease faetor for CSV 
would be 1.176, whieh is 4.5 times less than JSON format. Additionally, the data eould 
be redueed when the database has a null value. For this dataset we did not reduee the null 
values out of the data. Again, the main reason to keep the null value was to provide the 
most flexibility to the analyties in the future. 

There has to be a great deal of time spent on how to deal with the data when the possibility 
for the data to exeeed TBs exists. The larger the data set, the more effort must be spent in 
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minimizing the size of the data. Especially when the Hadoop ecosystem will increase the 
storage requirement of any dataset by three. This is due to the data replication factor set 
for Hadoop. The data replication factor is tunable but if reduced below three, the cluster 
will be more susceptible to faults and may decrease speed of the programs due to the larger 
network overhead of having to send the data to a node to run the computation. The Hadoop 
NameNode will try to schedule the data computation on an idle node where the data resides. 
Therefore, if the replication factor is decreased then the chances of finding an idle node 
that has the data decreases. Thus the NameNode will have to schedule the computation to 
another node and send that node the data resulting in the overhead of additional scheduling 
message as well as the actual sending of the data on the network. Hadoop can handle large 
amounts of data and the main limit on the size of the data Hadoop can handle is the physical 
limitations of the machines on which it is deployed. As an illustration to the possibility of 
a Hadoop cluster; in 2010, Facebook had a Hadoop cluster that was 2,000 nodes and had a 
storage capacity of twenty-one PB [18]. 

Once all of the decisions were made on how to parse the GCSS-MC database, the next 
step was importing the data into the actual Hadoop ecosystem. We chose to simply use the 
Hadoop file manipulations commands to ingest the JSON feed into the the Hadoop ecosys¬ 
tem. The ingest is ultimately accomplished with a bash script executed on the NameNode. 
The bash script executes the Java program to pull the data from the Oracle database and 
then executes the Hadoop file system command to import the data. We experimented with 
Squoop to do this and were pleased with the results. Squoop is an Apache Hadoop tool that 
supports importing and exporting of database data into Hadoop. However, because Squoop 
did not support the JSON format we needed, we wrote a Java program that executes Java 
Database Connectivity (JDBC) calls to the database and then generates the JSON format. 
The program is designed to make the JDBC and call to the database and create the JSON 
format for each row on a separate line in a text file. Each row of data in the database rep¬ 
resenting one line in the text file is important because of how Hadoop reads the file in a 
MapReduce algorithm. For instance, if the program spread a row of data beyond one line 
in the value, it would be near impossible to write a MapReduce program to recreate the 
database row. This is due to Hadoop MapReduce handling each line of input data sepa¬ 
rately, which is extremely important to ensure the program can be split up and executed in 
parallel. Of course, we could write another program to handle that and recreate the row. 
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but that would incur a large overhead eost. Beeause we wrote the program that pulls that 
data, it gives us the power to eontrol how the data is handled. We do not mean to say using 
additional tools like Squoop or another third party tool is a bad thing, just that in building 
our program we wanted to ensure we eontrolled the data at eaeh stage and writing our own 
program to handle that was the best solution in our experiment. Additionally, we deeided to 
try and keep the smallest footprint of software that was needed to run this experiment. We 
thought that if we ean ereate all of the programs we needed, we would better understand 
the data flow and thus better understand how to use the tools that abstraet the lower level 
coding. 


4.2 Top-lOO-NSN Program 

The first program we designed and developed is the NSN program. This program seans all 
13 files pulled from the database, and finds the top 100 most frequently oeeurring NSNs. 
Then it pulls the data assoeiated with those NSNs and ereates a SQL statement to write 
the data baek to the database. The idea of this program was to sean a large dataset and 
then ereate smaller tables that eontain only speoifieally defined needed data. During the 
ereation of this program we made the ehoiees as to what data to return. Although, we made 
edueated guesses on what a Supply Offleer would deem important, that is not the true proof 
of importanee of what we were able to show. The faet that the data is returned is what 
is important. The ehoiee of what subset of data is returned does not matter; that ean be 
ehanged in the program and then whatever data a eustomer wants ean be delivered. 

In order to get the desired result the program had to be written to exeeute several MapRe- 
duee jobs ehained together. In this program we had to write five separate MapReduee 
algorithms in order to aehieve our desired result. The first MapReduee algorithm simply 
eounts the oeeurrenees of NSNs. The seeond algorithm sorts the NSNs in deseending order 
of greatest frequeney. The next algorithm finds all of the data assoeiated with the NSNs. 
Then the final two algorithms ereate the ISON and the SQL statements and write the data 
to the Oraele database. Eaeh one of the algorithms was individually ereated and tested. 
Onee all of the algorithms were eompleted, we paekaged them up into a single MapReduee 
job to run. 

The flow of the entire program is illustrated in Figure 4.2. The figure gives a graphie 
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representation of the data flow through the whole program by illustrating the data at eaeh 
algorithm. The inputs are on the left side of the algorithm boxes and the output at the right 
side. The inputs to the top of the box are used by the algorithm in the eonfigure method of 
the map phase. 
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4.2.1 Top-lOO-NSN Program: First Algorithm 

The first MapReduce algorithm in the NSN program finds and counts all of the NSNs in 
the entire corpus. This algorithm is similar to a word count algorithm. 


Input 

All Database Tables 


{ 

"XXMC_HERIT_RETAIL_INVENTORY": [ 

1 {“ACTIVITY_ADORESS_COOE":"MMC100",“BACK_OROER_QUANTITY“:"0'’,“CONTROL_ITEM_COOE“:"null", 
“DAY30_USAGE_RD“:*’null”,"DUE_PROVISIONS":"0","OUE_STOCK":"0"/‘EXCH“:"0", "FIXED_LEVEL”:"0“, 
”FLOAT_REOROER":"25",“FR£E2E_COOE":“Y“,“FR6E2£_R£ASON":"GHOST SN",“fR££2£_DAT£“: 

"2013-05-01 23:24:47.0-,"GABF_DATE":"2013-12-08 11:31:33.0","LAST_TRANSACTION_DAT£-: 
"2013-10-15 13:24:01.0","MATERIAL_ID_COO£":"B","MOFFS£T":-0"/'NON_SYSTEH_ID_COO£":"null", 
"N0_1ST_RECEIPT":"null","OH_PROVISIONS’:"0","OH_STOCK_SERVICEABLES":"40",-OH_UNSERVICEABL£S": 
•■0","PHRAS£_COOE":"4","PRIHE_NSN":"4330010463399","NOMENCLATURE":"FILTER ELEMENT,FLUI", 
"RECORD NSN":“4330010463399","REC«OER_DATE":"2013-12-05 06:07:28.0","REORDER_POINT“:"25", 
"REQUISmON_O63£CTIV£":"33",“ROUTIN6_AOOR£SS_COOE":-MHC300","ROUTING_IO£NTIFieR_COO£":"MCl", 
"S£C_COO£‘:“U“,“SPL_ALLOW":"0",-STOR£_ACCOUNT_COOE":“l","SUPPLY_SOORC£_COO£":"null“, 
"TOTAL.MO.ALLOW":"0","UNIT_OF_ISSUE":"EA","UNIT^PRICE":"7.34","PROCESS.STATUS":"Y", 
"RECOR5_ID":"7390611S","CR£ATio_BY":"1277","CREATION_DATE":"2013-12-08 11:31:33.0", 
■LAST_UPOATEO_BY-:"1277","LAST_UPOATE_OATE":"2013-12.08 11:31:33.0","REQUEST_ID":"27221076", 
“BATCH_IO":"GENEP54431990","EXTERNAL_APPLICATION":"»erit","REQUIREM£NT_COOE":“lFUC", 
“OPERATION_COOE“:“61","IIP_QUANTITY":"0"}, 



Output 

Key/Value: NSN, Frequency 
Ascending Order 


1240015251648 645 
6140014851472 630 
1005013832872 579 
5855014320524 543 
1005012310973 465 

Figure 4.3: NSN: First Algorithm 


The major difference is the "word" that is counted is limited to an NSN. The map phase 
scans all thirteen tables and outputs only a valid NSN. The reduce phase will then sum 
up all of the occurrences of the NSNs and produce the output. This step is the first in the 
analysis process and is needed to find and output the most frequent NSNs in the data corpus. 
The data input to the algorithm is similar to the data in Figure 4.1 and the highlighted values 
are the NSNs we are searching for in this algorithm. The output sample is shown in Figure 
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4.4. 


1240015251648 

645 

6140014351472 

630 

1005013332872 

579 

5355014320524 

543 

1005012310973 

465 


Figure 4.4: NSN Output from the First Algorithm 


4.2.2 Top-lOO-NSN Program: Second Algorithm 

The next step is to sort the NSNs by frequeney. Hadoop will always provide some order on 
the output from the reduee phase. The order is given based on the value of the key of the 
key/value pair. Hadoop, by default, will order the output of the reduee phase in a alphabetie 
and numerie aseending order. 

In this algorithm we want to have the reduee phase produee an output that will be ordered 
by the frequeney of NSN oeeurrenees in deseending order. This has to be done in a separate 
algorithm than the first. If we attempt to order in the first algorithm before it eompletes, 
then the sum of the NSN oeeurrenees will not be aeeurately produeed. Now we have the 
output of the first algorithm as the input into the seeond algorithm. There are two major 
obstaeles to overeome to produee the desired output. First we have to override the output 
key eomparator elass to produee a key output in deseending order. This proeess oeeurs in 
the eombining and shuffling steps of Hadoop. The eode ereated to perform this deseending 
order sort is shown in Figure 4.6. The method overrides the method Hadoop ealls to 
eompare. This method is a reeursive funetion that transposes the order by multiplying 
everything by a negative one. 

The seeond obstaele is that we need to transpose the key/value pairs sueh that the frequeney 
value of NSNs is now the key and the value beeomes the NSN. This is done to produee an 
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1240015251648 

645 

Input 

6140014851472 

630 

Key/Value: NSN, Frequency 

1005013832872 

579 

Ascending Order 

5855014320524 

543 


1005012310973 

465 


Algorithm 2 


Output 

645 

1240015251648 

630 

6140014851472 

Key/Value: NSN, Frequency 

579 

1005013832872 

Descending Order 

543 

465 

5855014320524 

1005012310973 


Figure 4.5: NSN: Second Algorithm 


output that will be ordered in descending order based on the frequency of occurrence of a 
particular NSN. The transposing of the key/value pair is accomplished in the map phase. 
Then the shuffling and combing phase will produce the input to the reduce phase with the 
NSN frequency as the key and the NSN itself the value. A sample of the output of the 
algorithm can be seen in Figure 4.7. 
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static class ReverseComparator extends WritableComparator { 

private static final Text . Comparator TEXT_COMPARATOR = new Text . Comparator (); 

public ReverseComparator () { 
super(Text. class ); 

1 


@ Override 

public i 11 1 compare (byte[] bl, int si, int 11, byte[] hi, 
return (-1)* TEXT_COMPARATOR 

. compare (bl, si, 11, hi, s2, 12); 


int s2 . 


int 12) { 


Figure 4.6: Descending Order Sort Class 


645 1240015251648 
630 6140014351472 
579 1005013832372 
543 5855014320524 
465 1005012310973 

Figure 4.7: NSN Output from the Second Algorithm 


4.2.3 Top-lOO-NSN Program: Third Algorithm 

The output from the seeond algorithm enables us to seareh and find all of the data assoeiated 
with the 100 most frequently oeeurring NSNs. In order to faeilitate using the output from 
the seeond algorithm we used the distributed eaehe funetionality that is part of Hadoop. 
The distributed eaehe allows the programmer to aeeess other files on the HDFS. Then we 
had to deeide what type of data strueture to use to build with the output from algorithm two. 
After some initial testing we found that using a hash map provided the faster eomparisons 
than a map or a pattern. Hadoop also provides the eonfigure method to use as means to set 
up data struetures that will be needed in the map phase. The eonfigure method allow the 
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programmer to build data structures that can be used in the map phase of the algorithm. 
This is critical because each map instance will have to build the data structure exactly the 
same in order to achieve predictable,repeatable results. In this particular configure method 
we create a hash map with the first 100 NS Ns from the output of algorithm two. 


Input 

All Database Tables 


{ 

”XXMC_HERIT_RETAIL_INVENTORY“: [ 

1 {“ACTIVITY_ADORESS_COOE":"MMC10e~,“BACK_ORDER_QUANTITY":"0“,"CONTROL_ITEM_CODE“:“null", 
'‘DAY30_USAGE_RD":“null","C)UE_PROVISIONS“:"0“,“OUE_STOCK“:"0~,"EXCH“:"0“, "FIXED_LEVEL":“0", 
"FLOAT_REORDER":“25","FREE2E_COOE":“Y“,"FREE2e_REASON“:“GHOST SN“/*FREE2e_DAT£": 

"2013-05-01 23:24:47.0","GABF_DATE“:"2013-12-08 11:31:33.0","LAST_TRANSACTION_DATE": 
"2013-10-15 13:24:01.0","WATERIAL_ID_COOE":“B","MOFFSET":"0","NON_SYSTEM_ID_COOE“:"null", 
"N0_1ST_RECEIPT":"null",”0H_PR0VISI0NS":"0","OH_STOCK_SERVICEABLES":"40","OH_UNSERVICEABLES": 
"0","PHRASE_COOE“:"4","PRIHE_NSN":"4330010463399","NOMENCLATURE":"FILTER ELEMENT,FLUI", 
"REC0R0_NSN":"4330010463399","REORDER.DATE":"2013-12-05 06:07:28.0","REORDER_POINT":"25", 
"REQUISITION_Oe3ECTIVE":"33","ROUTING_A[)ORESS_CODE":"HMC300","ROUTING_IOENTIFIER_COOE":"MCl", 
“SEC_COOE":"U","SPL_ALLOW":"0","STORE_ACCOUNT_COOE":"l“,"SUPPLY_SOURCE_CXE":"null", 
"T0TAL_M0_ALL0W":"0","UNIT_OF_ISSUE”:"EA","UNIT_PRICE":"7.34","PROCESsIsTATUS":"Y", 
"RECORD_ID":"73906115","CREATED_BY":"1277","CREATION_OATE":"2013-12-08 11:31:33.0", 
"LAST_UPDATED_BY":“1277","LAST_UPOATE_DATE":“2013-12-08 11:31:33.0","REQUEST_ID":"27221076", 
"BATCH_ID":"GENEP54431990","EXTERNAL_APPLICATION":"i»erif’,"REQUIREMENT_C00E":"IFUC", 
"0PERATI0N_C00E":"61","IIP_QUANTITY":"0"}, 


Configure 

Key/Value: NSN, Frequency 
Descending Order 

645 1240015251648 
630 6140014851472 
579 1005013832872 
543 5855014320524 
465 1005012310973 



Output 

Key/Value: NSN, JSON where NSN occurs 


1005011182640 

1005011182640 

1005011182640 

1005011182640 

1005011182640 


{••UNIT_NAME" 
{"UNIT_NAME" 
{"UNIT_NAME" 
{"UNIT_NAME" 
{•'UNIT_NAME" 


■•null", "TAMCN" 
"null","TAMCN" 
"null","TAMCN" 
"null","TAMCN" 
"null","TAMCN" 


"null" 

"null" 

"null" 

"null" 

"null" 


} 

} 

} 

} 

} 


Figure 4.8: NSN: Third Algorithm 


The map phase will then scan all of the data in the corpus and produce an output from 
the map phase if and only if the NSN is found in the line of data. The line is output in 
JSON format with the NSN and the source file appended to the front of the JSON. The 
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reduce phase of this algorithm is the identity reduce. It simply outputs the key/value pairs 
it receives from the map phase. A sample of the output can be found in Figure 4.9. 


1005011182640 

1005011182640 

1005011182640 

1005011182640 

1005011182640 


{"UNIT_NAME":"null","TAMCN":"null" 
{"UNIT_NAME":"null","TAMCN":"null" 
{"UNIT_NAME":"null","TAMCN":"null" 
{"UNIT_NAME":"null","TAMCN":"null" 
{"UNIT_NAME":"null","TAMCN":"null" 


Figure 4.9: NSN Output from the Third Algorithm 


} 

} 

} 

} 

} 


4.2.4 Top-lOO-NSN Program: Fourth Algorithm 

The next step is to use the output from the third algorithm and parse the format down to the 
ISON that we wish to write back to the database. 

This algorithm and the next one could be reduced to one that generates the ISON and cre¬ 
ates the SQL statement to write to the database using a JDBC call. Ultimately, we decided 
to keep them separate so that we could use the ISON format in the future to regenerate 
the tables, if necessary. Furthermore, to maintain the flexibility to adapt to other tools, like 
a database Hadoop connector, we felt it prudent to keep the JSON formating code. This 
algorithm takes the output from the third algorithm and parses it down into a JSON format 
to write back to the database. A sample of the output of the algorithm can be seen in Figure 
4.11. 

At this point we made some decisions about what data to write back. Table 4.1 illustrates 
the data that we write back to the database. At this point the data choice can be modified 
and tailored to exactly what a customer/stakeholder would desire. The choices were made 
just show a proof of concept that the data can be found and written back to a database to 
show a smaller amount of data that is desired. However, any of the data that exists in the 
original database can be retrieved with only minor modifications to the code base. The 
benefit of this type of analytic is truly realized as the data size increases, the customer can 
keep all of their data and produce tables of just the pertinent data required for a particular 
need. 
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Input 

Key/Value: NSN, JSON where NSN occurs 


1005011182640 

1005011182640 

1005011182640 

1005011182640 

1005011182640 


{"UNIT_NAME” 
{"UNIT_NAME*' 
{'’UNIT_NAME*' 
{*‘UNIT_NAME" 
{"UNIT_NAME" 


"null","TAMCN” 
"null","TAMCN" 
"null","TAMCN" 
"null","TAMCN" 
"null","TAMCN" 


"null" ... } 
"null" ... } 
"null" ... } 
"null" ... } 
"null" ... } 



Output 

Key/Value: NSN, New JSON 
Formatted to new output table 


{"NSN":"1005011182640" 
("NSN”:"1005011182640" 
("NSN":"1005011182640" 
("NSN":"1005011182640" 
("NSN":"1005011182640" 


Table Name: 
Table Name: 
Table Name: 
Table Name: 
Table Name: 


XXMC_R001_ALLOWANCES_TBL.TXT ,"UNIT_NAME* 
XXMC_R001_ALLOWANCES_TBL.TXT ,"UNIT^NAME* 
XXMC_R001_DUEIN_TBL.TXT ,"ORDER_NUMBER": 
XXMC_R001_DUEIN_TBL.TXT ,"ORDER^NUMBER": 
XXMC ROOl DUEIN TBL.TXT ,"ORDER NUMBER": 


Figure 4.10: NSN: Fourth Algorithm 


{"NSN": 

"1005011182640" 

Table 

{"NSN": 

"1005011182640" 

Table 

{"NSN": 

"1005011182640" 

Table 

{"NSN": 

"1005011182640" 

Table 

{"NSN": 

"1005011182640" 

Table 


Name: 
Name: 
Name: 
Name: 
Name: 


XXMC_R001_ALLOWANCES_TBL.TXr ,"UNIT_NAME":"NULL" .. 
XXMC_R001_ALLOWANCES_TBL.TXT ,"UNIT_NAME";"NULL" .. 
XXMC_R001_DUEIN__TBL.TXT , "ORDER_NUMBER" : "1041015" 
XXMC_R001_DUEIN_TBL.TXT ,"ORDER_NUMBER";"1596718" 
XXMC_R001_DUEIN_TBL.TXT ,"ORDER_NUMBER":"616556" . 


} 

} 


Figure 4.11: NSN Output from the Fourth Algorithm 


';"NULL" 
’:"NULL" 
;"1041015 
:"1596718 
; "616556" 
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Table 4.1: Tables Written Back to Database 
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4.2.5 Top-lOO-NSN-Program: Fifth Algorithm 

The final algorithm for this program performs the MapReduee funetion that makes the 
JDBC eall to write the data back to the database. This algorithm takes the formated JSON 
output from the previous algorithm and creates an SQL statement to write the data to the 
database. 


Input 

Key/Value: NSN, New JSON 
Formatted to new output table 


{"NSN":"1005011182640" Table 
{"NSN":"1005011182640" Table 
{"NSN":"1005011182640" Table 
{•’NSN’':"1005011182640" Table 
{•*NSN":”1005011182640" Table 


Name: XXMC_R001_ALLOWANCES_TBL, 
Name: XXMC_R001_ALLOWANCES_TBL, 
Name: XXMC_R001_DUEIN_TBL.TXT 
Name: XXMC_R001_DUEIN_TBL.TXT 
Name: XXMC ROOl DUEIN TBL.TXT 


TXT /•UNIT_NAME":"NULL" ...) 
TXT ,"UNIT_NAME":"NULL" ...) 
/•ORDER_NUMBER":"1041015'’ ...) 
/•ORDER^NUMBER":’'1596718" ...} 
,"ORDER_NUMBER'’:*'€16556" ...) 



Output 

Key/Value: NSN, SQL Insert Statement 


6135009857845 INSERT 
6135009857845 INSERT 
6135009857845 INSERT 
6135009857845 INSERT 
6135009857845 INSERT 

Figure 4.12: NSN: 


INTO HDFS_R001_AI-LOWANCES_NSN (NSN, SOURCE TBL, 
INTO HDFS_R001_ALLOWANCES_NSN (NSN, SOURCE_TBI„ 
INTO HDFS_R001_ALLOWANCES_NSN (NSN, SOURCETBL, 
INTO HDFS_R001_ALLOWANCES_NSN (NSN, SOURCE_TBL, 
INTO HDFS_R001_ALLOWANCES_NSN (NSN, SOURCE_TBL, 

Fifth Algorithm 


The JDBC connection is made in the configure method of the map phase. The map phase 
then creates the statement and makes the JDBC call to execute the statement. The output of 
the map class is the NSN and SQL statement. The reduce phase in the identity reducer and 
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simply outputs what it receives. The reduce phase could be eliminated and the output taken 
directly from the Map phase, which would speed up the algorithm execution. We decided 
to keep the reduce phase to sort the map output. A sample output of the algorithm can be 
seen in Figure 4.13. 


6135009857845 

6135009857845 

6135009857845 

6135009857845 

6135009857845 


INSERT INTO HDFS_R001_ALLOWANCES_NSN (NSN, 
INSERT INTO HDFS_R001_ALLOWANCES_NSN (NSN, 
INSERT INTO HDFS_R001_ALLOWANCES_NSN (NSN, 
INSERT INTO HDFS_R001_ALLOWANCES_NSN (NSN, 
INSERT INTO HDFS_R001_ALLOWANCES_NSN (NSN, 


SOURCE_TBL, 

SOURCE_TBL, 

SOURCE_TBL, 

SOURCE_TBL, 

SOURCE_TBL, 


Figure 4.13: NSN Output from the Fifth Algorithm 


4.2.6 Top-lOO-NSN Program Conclusion 

The result of the entire NSN program is the ereation of eleven new tables in the database 
with an aggregated 916,190 rows. Table 4.2 shows the first ten rows of data form the 
HDFS_R001_INVENT0RY_NSN table ereate by the NSN program. The power of this type 
of analytie is that it gives the eustomer the power to analyze all data and filter their data 
down to only the data they are eoneerned with at the time. This ereates the flexibility to 
maintain and analyze their entire data set and also reduee it down to a manageable subsets 
of needed data. This program does just that; it seans all of the data and then returns to the 
database only the data of the most frequent NSNs with the redueed subset of the original 
data. Although, this program writes a subset of the data baek to the database, it just as 
easily eould reduee the data further into fewer tables by joining the data from any/all of the 
tables. The possibility of the analyties Hadoop ean perform is limitless. 


Table 4.2: Sample of HDFS R001 INVENTORY NSN Table 


NSN 

Source Table Name 

RECDRD ID 

STATUS C0DE 

SERIAL NUM 

TAMCN 

1005013832872 

XXMC R001 INVENT0RY TBL 

441912925 

LATEST 

10328808 

E14422M 

1005013832872 

XXMC R001 INVENT0RY TBL 

441912926 

LATEST 

10328831 

E14422M 

1005013832872 

XXMC R001 INVENT0RY TBL 

441912927 

LATEST 

10328995 

E14422M 

1005013832872 

XXMC R001 INVENT0RY TBL 

441912928 

CREATED 

10329091 

E14422M 

1005013832872 

XXMC R001 INVENT0RY TBL 

441912929 

LATEST 

10329141 

E14422M 

1005013832872 

XXMC_R001_INVENT0RY_TBL 

441912930 

LATEST 

10329204 

E14422M 

1005013832872 

XXMC_R001_INVENT0RY_TBL 

441912931 

LATEST 

10329312 

E14422M 

1005013832872 

XXMC_R001_INVENT0RY_TBL 

441912932 

LATEST 

10329315 

E14422M 

1005013832872 

XXMC_R001_INVENT0RY_TBL 

441912933 

LATEST 

10329316 

E14422M 

1005013832872 

XXMC_R001_INVENT0RY_TBL 

441912934 

LATEST 

10329351 

E14422M 
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4.3 SQL Simulation 

The SQL simulation program was an analytic taken from another ongoing Naval Postgrad¬ 
uate project that is examining the same GCSS-MC data. We felt it prudent to create an 
analytic on something the USMC is looking for currently. The other project is focusing 
on using SQL to generate analytics and then display the results in a web tier architecture. 
Although we do not take the results to a display in the web tier, we do produce the same 
results and write the results back to the database. The overarching idea is that the HDFS 
resources are used to process the data and then something like a web tier business analytic 
suite can display the table in a graph. The total program, we will call it Alpha, consists of 
four MapReduce algorithms. The purpose of the analytic is to produce a readiness report 
aggregated over time, equipment, and/or unit. Figure 4.14 illustrates the data relationship 
and the logic used in the Alpha program. 

Alpha 



Figure 4.14: Alpha Program Logic 


The flow of the entire program is illustrated in Figure 4.15. The figure gives a graphic 
representation of the data flow through the whole program by illustrating the data at each 
algorithm. The inputs are on the left side of the algorithm boxes and the output at the right 
side. The inputs to the top of the box are used by the algorithm in the configure method of 
the map phase. 
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The SQL analytic consists of data calculations and three table joins. Accomplishing this in 
Hadoop requires the five MapReduce steps. The first step is counting all NSNs with an Op¬ 
erational Status that are "Deadlined" value in the XXMC_R001_SRHEADERS_TBL. The next 
step scans the XXMC_R001_ITEMMASTER_TBL and gets all of the NSNs that are "MARES" 
reportable. The third step scans the XXMC_R001_INVENT0RY_TBL and calculates all the 
number of "Onhand" NSNs. The fourth step joins the outputs from step two and three and 
performs the calculation for percentage of "Deadlined" versus "Onhand". The final step 
writes the data back to the database table. 

4.3.1 SQL Simulation: First Algorithm 

The first MapReduce algorithm performed in the alpha program will scan the 
XXMC_R001_SRHEADERS_TBL. The map phase will parse out the data and find all of the 
rows that have an Operational Status of "Deadlined". Once the "Deadlined" value is seen 
in the row the NSN associated with that row is recorded and then the map phase will 
output the key/value pair of the NSN and the value one. The SRHEADERS table will 
have and entry for each item that has been "Deadlined". Therefore, there may be several 
items "Deadlined" associated with a particular NSN and we need to have the total number 
"Deadlined" for each NSN. 

The reduce phase will take care of summing up all of the "Deadlined" values for each NSN. 
The output from the first algorithm is the NSN and the total number of items "Deadlined" 
for that particular NSN. A sample of the output of the algorithm can be seen in Eigure 
4.17. 

4.3.2 SQL Simulation: Second Algorithm 

The second MapReduce algorithm will take as input the XXMC_R001_ITEMMASTER_TBL. 
The alogrithm will also access the output from the first algorithm as distributed cache, 
Eigure 4.15, and will be used in the configure method to build a hash map with the key/value 
pair as the NSN/number "Deadlined". The map phase will then scan the ITEMMASTER 
table and check all NSNs to determine the "MARES" status. If the NSN is "MARES" 
reportable, the algorithm will check to see if the NSN is a key in the hash map. If the 
NSN is a key in the hash map then the number "Deadlined" will be obtained from the hash 
map and the map phase will set the output key/value pair to NSN/"MARES"_"Number 


47 



{ 

1 ("ACTIVITY_ADDRES3_CODE" ;"MMC100", "BACK_ORDER_QCJANTITY": "0","CO*ITROL_irEM_OODE";"null", 
"DAY30_USAGE_RD":"null*, “DUE_PR07ISICNS" ; "0", "DCrE_STOCK" :"0", "EXCH":"0*, 

"FIXED_LEVEL":"0", "FLCAT_RBORDER":"25","rREEZE_CODE":"Y","rREE2E_REASON":"GHCeT SN", 
"rREE2E_DATE":"2013-05-01 23:24:47.0", "GABr_DATE":"2013-12-06 11:31:33.0", 
"LAST_TRANSACTr<»*__DATE": "2013-10-15 13:24:01.0", "MATERIAL_ID__OODE" :"B", "MOFFSET": "0", 

SRHEADERS Table "NOK_iYSTEM_ID_CCOE":"null", "KO_lST_BECEIPT":"null","OH_EROvIsiOMS":"0", 

"OH_STOCK_SERVlCEABLES":*40","OH_UN3ERVICEABLES":"0","PHRASE_CODE":"4", 

"PrImE_NSN":" 4330010463399", "NCKENCLATURE" : "FILTER ELEMOH', FUJI", "R£C0Rr>_NSN": "4330010463399", 
"REOROER_DATE":"2013-12-05 06:07:26.0","RE0R0ER_P01NT":"25","BEQUISITION_OBJECTIVE":"33", 
"ROOTING_ADDRESS_CODE":"MMC300","ROUTING_lDENTIFlER_CODE":"MCI","SBC_CODC":"U", 

"SPL_ALLOW":"0",*STORE_ACCOUNT_CODE":"1","3UPPLY_SOURCE_CODE":"null",•TOTAL_MO_ALLOW":"0", 
"UNIT_OF_I33UE":"EA","UN1T_PR1CE":"7.34","PROCE3S_3TATU3":"Y",■RECCRD_1D":"73906115", 
"CREATED_BY":"1277","CREATION_DATE":"2013-12-08 11:31:33.0","IAST_UPnATED_BY";"1277", 
"LA3T_UPDATE_DATE":"2013-12-08 11:31:33.0","REC!UEST_ID":"27221076","BATCH_ID":■GENEPS4431990", 
"EXTERNAL_APPLICATIC»J": "merit", *REQUIREMENT_CODC" : "IFUC". "OPERATI<»l_COOE" : "61", 

"11P_QUANTITY":"0"), 




Output 

Key/Value: NSN, Number "Deadlined' 


1005007265636 

187 

1005009573893 

9 

1005010258095 

21 

1005010351674 

1 

1005011055191 

1 


Figure 4.16: Alpha: First Algorithm 


Deadlined". 

The reduce phase in this case is the identity reducer. A sample of the output of the algorithm 
can be seen in Figure 4.19. 
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1005007265636 

187 

1005009573893 

9 

1005010258095 

21 

1005010351674 

1 

1005011055191 

1 


Figure 4.17: Alpha Output from the First Algorithm 


4.3.3 SQL Simulation: Third Algorithm 

The next MapReduee algorithm takes the XXMC_R001_INVENT0RY_TBL as input. The algo¬ 
rithm will use the output from the second algorithm to create a hash map in the configure 
method. The hash map will store the NSN as the key and the value will be a string that con¬ 
tains the "Mares" category and number "Deadlined". The map phase is going to scan the 
INVENTORY table and check if the NSN is in the hash map and if it is then get the value 
for the quantity "Onhand". The map output is the key/value pair NSN/number "Onhand". 

The reduce phase will calculate the total number "Onhand" for each NSN. A sample output 
of the algorithm can be seen in Figure 4.21. 

4.3.4 SQL Simulation: Fourth Algorithm 

The fourth MapReduce algorithm takes the output from the second algorithm as input and 
the output from the third algorithm as distributed cash. The configure method is used to 
build a hash map from the third algorithm output. The map phase then performs a map 
side join on the output from the previous two algorithms. The input data is parsed and then 
values from the hash map are appended. The algorithm is also responsible for performing 
the percentage of "Deadlined" vs. number "Onhand". The map output is the key/value pair 
NSN/number "Deadlined"_"MARES" Category_number "Onhand"_Percentage. 

The reduce phase is the identity reducer. A sample of the output of the algorithm can be 
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( 

1 {"ACTlVITY_ADDRESS_CODE":"MMCIOO*',"BACK_ORDER_QUAMriTY":"0","CONTROL_ITEM_CODE":"null", 
"DAY30_U3AGE_RD";"null","D«E_PROVISICNS":"0","DUE_STOCR";"0","EXCH":"O", 

■FIXED LEVEL":"0","FLOAT REORDER":"25","FREEZE CODE":"Y","FREEZE REASON":"GHOST SN", 
"FREEZE_DATE":"2013“05-0l 23:24:47.0","GABF_DATE":"2013-12-08 llT31:33.0", 

ITEMMASTER Table "last^transaction^date" : *2013-10-15 13:24:01.0", “material_id_code":"b", "moffset": "O", 

•N0N_SYSTEM_ID_C0DE":"null","NO_lST_RECEIPT":"null",■OK_PROVISIONS":"0", 

"OH_SrOCK_SERVICEABLES":"40","OH_UNSERVICEABLES":"0","PHRASE_OOOE":"4", 

"PR1ME_NSN":"4330010463399", "NCMENCIATORE":"FILTER ELEMENT,FUJI", "RECORD_NSN":■4330010463399", 
■RE0R0ER_DATE":"2013-12-05 06:07:28.0", ■REORDER_POINT":"25","REQUISITION_OBJECTIVE":"33", 
"ROOTlMG_ADDR£SS_CODE":■MMC300","ROUT1NG_1DENTIF1ER_CODE":"MCI", "SEC_COE*;":"U", 

■SPL_ALLOW":"0","STORE_ACCOONT_OODE":* 1","SUPPLY_SOURCE_CODE“:"null",■TOTAL_I40_ALLOW*:"0", 
"UN1T_0P_1SSUE";"EA","UNIT_PR1CE":"7.34 ",■PROCE3S_8TArUS":"Y",•RECORD_ID";"73906115", 
■CREATED_BY": "1277", "CREATIC»*__DATE" : "2013-12-08 11:31:33.0", ■IAST_UPDATED_BY": "1277", 
■LAST_UPDATE_DATE":"2013-12-08 11:31:33.0", "REQUEST_ID":"27221076","BATCH_ID":"GENEP54431990", 
"EXTERNAL_APPLlCATION": "merit", ■REQ01REMEin'_CODe" : "IFUC", "OPERATION_CODE" : "61", 

"IIP^QUANTITY":"0"} , 



Configure 

Key/Value; MSN, Number "Deadlined" 

1005007265636 187 

1005009573893 9 

1005010258095 21 

1005010351674 1 

1005011055191 1 


187_MARES 
9_MARES 
1_MARES 
0_MARES 
172_MARES 

Figure 4.18: Alpha: Second Algorithm 


Output 

Key/Value; NSN, Number 
"Deadllned"_"MARES" Category 


1005007265636 

1005009573893 

1005010351674 

1005013592714 

1005014123129 


Algorithm 2 


seen in Figure 4.23. 

4.3.5 SQL Simulation: Fifth Algorithm 

The final MapReduee algorithm will take the output from the fourth algorithm as input. 
The eonfigure method will be used to set up the JDBC eonneetion to the database. The 
map phase parses the input and ereates and executes a SQL insert statement for each line 
of input. 
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1005007265636 

1005009573893 

1005010351674 

1005013592714 

1005014123129 


187_MARES 
9_MfiRES 
l_MftRES 
0_MfiRES 
172 MARES 


Figure 4.19: Alpha Output from the Second Algorithm 


The final output of the algorithm is the key/value pair of NSN/SQL statement.A sample 
output of the algorithm ean be seen in Figure 4.25. 


1005007265636 

187 

1005009573893 

9 

1005010258095 

21 

1005010351674 

1 

1005011055191 

1 


Figure 4.25: Alpha Output from the Fifth Algorithm 


4.3.6 Alpha Program Conclusion 

The result of the entire program is a join aeross three tables, ealeulations, and a database 
table ereated that eontains all of the data. A sample of the resultant Alpha table is in Table 
4.3. The purpose of this analytie was to show that Hadoop ean perform the same type of 
analysis of struetured data that SQL ean. This will beeome inereasingly important as the 
data set approaehes values over 1TB. The methodology ean be adapted to perform all types 
of SQL statements with no limitation. One thing the Alpha program does not show is the 
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Input 

INVENTORY Table 


{ 

1 {■'ACTIVlTY_ADDRESS_OODE": "MMClOO", "BACK_ORDER_QOWn'ITY": "O", "CONTROL_ITEM_CODE":"null", 
■DAY30_^OSAGE_RD":"null","DOE_PRCW1SICN3":"0","DUE_STOCK":"0","EXCH":"0", 

"FIXED_LEVEL":“0", "rLOAT_REORDER":"25","FFEEZE_CODE*:"Y",“FREBZE_REASON“:"0HC6T SN", 
"rREE2E_DATE";"2013-05-01 23:24;47.0","GABr_DATE";”2013-12-08 11;31:33.0", 
"LASr_TRAN3ACT10N_DATE":"2013-10-15 13:24:01.0","MATERIAL_ID_OODE":"B","MOFF3ET":"0", 
*N0N_iYSTEM_lD_C05E":"null","M0_1ST_RECEIPT":"null","OH_PROVISIONS":"0", 

"0H_ST0CK_SERV1CEASLES":"40","OH_imSERVlCEABLES":"0","PHSASE_COOE":"4", 

"PRIME_NSN":"4330010463399", "NCMENCIATURE":"FILTER ELEMQIT,FLOI", "RECaRD_NSN":"4330010463399", 
"REORDER_DArE":"2013-12-05 06:07:28.0","REORDER_POIMr":"25","REQOISITION_OBJECTlVE":"33", 
"ROUTlNe_ADr»£33_^CODE": "MMC300", "ROUTlNG_IDENTlPI£R_CODE": "MCI", "SEC^COEC" : "U", 

"SPL_ALLOW";*0","8TORE_ACC(XJNT_CODE";*1","80PPLY_30URCE_C0DE";"null","TOTAL_HO_ALLOW":"0*, 
"UNIT_Or_IS3yE":"EA","UN1T_PRICE":"7.34","PROCESS_3TATUS";"Y","RECCRD_1D":"73906115", 
"CREATED_BY":"1277","CREATION^DATE":"2013-12-08 11:31:33.0","LA3T_UPDATED_BY":"1277", 
"LASr_UPDATE_DATE":"2013-12-0i 11:31:33.0", "REQCJEST_ID": "27221076", "BATCh”iD": "GEKEP54431990", 
"EXTERNAL_APPLlCAriON": "merit", "REQUIR£MENr_CODe": "IFUC", "OPERAT1ON_CO0E": "61", 

"IIP^QUAKTITY":"0"}, 


Configure 

Key/Value: NSN, Number 
"Deadlined"_"MARES" Category 


1005007265636 

1005009573893 

1005010351674 

1005013592714 

1005014123129 


187_MARES 
9_MARES 
1_MARES 
0_MARES 
172 MARES 



Output 

Key/Value: NSN, number "Onhand" 


Figure 4. 


1005007265636 

3880 

1005009573893 

497 

1005010351674 

4 

1005013592714 

59 

1005014123129 

9108 


: Alpha: Third Algorithm 


ability to use multiple inputs to perform reduee side joins. We purposely showed the map 
side joins because we felt like it was easier to demonstrate the data flow with map side 
joins. However, the number of algorithms can be reduced by using reduce side joins. There 
is also a small speed up that can be achieved on the reduce side join, but it is not significant 
because all of the data still needs to be scanned the same amount. 
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1005007265636 

3880 

1005009573893 

497 

1005010351674 

4 

1005013592714 

59 

1005014123129 

9108 


Figure 4.21: Alpha Output from the Third Algorithm 


Table 4.3: Sample of HDFS ALPHA Table 


NSN 

NUMBER.DEADLINED 

MARES.CATEGORY 

QUANTITY.ONHAND 

PERCENTAGE 

3895014538573 

4 

MARES 

24 

16.67 

3895015390585 

1 

MARES 

22 

4.55 

3895015508369 

0 

MARES 

25 

0.00 

3895015733847 

1 

MARES 

1 

100.00 

3930014783519 

54 

MARES 

389 

13.88 

3930014862151 

1 

MARES 

6 

16.67 

3930015080886 

58 

MARES 

541 

10.72 

3930015227364 

6 

MARES 

106 

5.66 

3930015330855 

22 

MARES 

187 

11.76 

3930015735873 

2 

MARES 

5 

40.00 
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Input 

Key/Value: NSN, Number 
''Deadlined"_"MARES" Category 


1005007265636 

1005009573893 

1005010351674 

1005013592714 

1005014123129 


187_MARES 
9_MARES 
1_MARES 
0_MARES 
172 MARES 


Configure 

Key/Value: NSN, number "Onhand" 


1005007265636 

3880 

1005009573893 

497 

1005010351674 

4 

1005013592714 

59 

1005014123129 

9108 



Output 

Key/Value: NSN, Number 
"Deadlined''_"MARES" Category^ 
number "Onhand"_Percentage 


1005007265636 

1005009573893 

1005010351674 

1005013592714 

1005014123129 


187_MARES_3880_4.82 
9_MARES_497_1.81 
1_MARES_4_25.00 
0_MARES_59_0.00 
172 MARES 9108 1.89 


Figure 4.22: Alpha: Fourth Algorithm 


1005007265636 

1005009573893 

1005010351674 

1005013592714 

1005014123129 


187_MARES_3880_4.82 
9_MARES_497_1.81 
1_MARES_4_25.00 
0_MARES_59_0.00 
172 MARES 9108 1.89 


Figure 4.23: Alpha Output from the Fourth Algorithm 
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Input 

Key/Value: NSN, Number 
"Deadlined"_"MARES" Category^ 
number "Onhand"_Percentage 


1005007265636 

1005009573893 

1005010351674 

1005013592714 

1005014123129 


187_MARES_3880_4.82 
9_MARES_497_1.81 
1_MARES_4_25.00 
0_MARES_59_0.00 
172 MARES 9108 1.89 



Output 

KeyA^alue; NSN, SQL Insert Statement 


1005007265636 

INSERT 

1005009573893 

INSERT 

1005010351674 

INSERT 

1005013592714 

INSERT 

1005014123129 

INSERT 


INTO HDFS_ALPHA(NSN, 
INTO HDFS_ALPHA(NSN, 
INTO HDFS_ALPHA(NSN, 
INTO HDFS_ALPHA(NSN, 
INTO HDFS_ALPHA(NSN, 


NUMBER_DEADLINED ...) 
NUMBER_DEADLINED ...) 
NUMBER_DEADLINED ...) 
NUMBER_DEADLINED ...) 
NUMBER_DEADLINED ...) 


Figure 4.24: Alpha: Fifth Algorithm 
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CHAPTER 5: 
Conclusion 


The amount of data that is collected and stored continues to increase everyday. The In¬ 
ternational Data Corporation (IDC) estimates that by the end of 2013 data stored data will 
be 2.7 zettabyte (ZB)s, that is a forty-eight percent increase from 2011 [19]. The USMC 
is no different than the commercial world. The size of data stored in USMC databases is 
growing, with the current size of the GCSS-MC database at six TBs. There is a need to 
find a solution to manage data storage increases in the USMC. This thesis demonstrated 
that big data analytics could be added to the current GCSS-MC architecture to address the 
issue of giving the USMC the power of using all of their data in developing analytics. The 
remainder of this chapter further explains how the research questions have been answered 
and covers future work. 

5.1 The Outcome 

The power of processing GCSS-MC data in Hadoop is promising. The thesis shows exam¬ 
ples of the analytics that can be run in the Hadoop ecosystem on the GCSS-MC data. This 
work shows promise that a Hadoop cluster can handle the analytics that are needed now 
and is flexible enough to allow for the programming on additional analytical needs. With 
continued efforts and exploration, as previously mentioned, the power of big data analytics 
could be at our fingertips providing function and simplicity to a complicated large data set. 

The research questions posed in Chapter 1 were the guide to the overall proof of concept 
behind adding big data analytics to GCSS-MC. We determined that a big data element 
can be added to the GCSS-MC system and that it can be used to provide data analytics 
within the GCSS-MC system. After completing the research and examining the results 
we found that some of the questions were broad or vague. We decided to pursue this 
research using HDFS and as small of a software footprint as possible. This thesis shows that 
concept of adding Hadoop to the GCSS-MC system is achievable. However, more work is 
needed to show how Hadoop could be integrated into a system that more closely resembles 
a production GCSS-MC system. The rest of this section will reiterate the research questions 
and indicate where the details of the research can be found within the thesis. 
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The first research question is: What would an architecture look like that adds a big data 
element to GCSS-MC? This question is very broad and can be approached several ways. 
In this thesis we decided to use HDFS as the big data element to add to the GCSS-MC 
system. Furthermore, we built a HDFS cluster and showed how it could be used as a 
benefit to the GCSS-MC system. Chapter three and four explain these results in detail. 

The second research question is: If an architecture can be developed, what modifications 
would the GCSS-MC architecture need? As discussed, in detail in chapter three. We are 
using a sample of the GCSS-MC database and demonstrating cluster interaction with that 
database though IP connectivity. This abstraction of the real GCSS-MC system worked 
very well in this thesis. We found that we were able to integrate the HDFS solution rather 
seamlessly into our sample database and experiment architecture. In order to fully prove 
that this approach will work, it needs to be further tested on an actual implementation of 
the GCSS-MC system. 

The third research question is: How can the data contained in GCSS-MC be imported into 
HDFS? In this thesis we used a JDBC to connect to the sample GCSS-MC database to parse 
the data in JSON format. Then the data is imported into HDFS with a bash script. Chapter 
four explains in more detail the process that was used to ultimately get the GCSS-MC data 
into the HDFS ecosystem. 

The fourth research question is: What type of analytics can Hadoop provide for GCSS-MC 
data? This thesis explores two explanations of how to provide analytics on GCSS-MC 
data. Those example are by no means the only analytics that HDFS can provide. They are 
merely representative of what HDFS can provide. More specifically, the Alpha program 
is an example of a analytic the USMC is asking to be completed on the GCSS-MC data. 
Chapter four explains the code of both programs in great detail. 

The fifth and final research question is: How will the data get back to the GCSS-MC 
database? We choose to write back to the GCSS-MC database in the MapReduce code. We 
achieved this by using a JDBC call within the map phase of the MapReduce code. Although 
this is not the only way to achieve writing data to a database from Hadoop, we felt it was 
the best way to achieve the data write back functionality. Chapter four discusses this code. 
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5.2 Future Work 

This thesis demonstrates the potential of using a Hadoop eluster as a big data element in 
the GCSS-MC system. There are a few things that could be further examined to enhance 
a big data element of the GCSS-MC system. This section will focus on three additional 
research areas that could be extended from this thesis. 

The first area of research that can be extended from this thesis is moving the cluster to 
a larger environment. We set up, the cluster on two machines and all of the nodes are 
virtualized. In order to better represent what would be seen in a GCSS-MC production 
environment, a cluster with greater than twenty nodes should be used. Additionally, the 
security of the cluster and the interconnectivity between the cluster and the GCSS-MC 
should be examined. 

The next area of research could be a comparison between SQL and Hadoop analytics. 
For instance, the Alpha program is derived from an analytic the USMC requested in the 
GCSS-MC system. The Alpha program could run on a cluster and the SQL version could 
be executed on a database and the runtime, compute power, and memory usage could be 
compared between the cluster and database. Furthermore, there are four more analytics 
that can be written in MapReduce to allow for a more complete comparison. 

The final extension of research to the thesis could be the addition of Hadoop big data tools. 
Hadoop offers several tools that help in the processing, analyzing, and importing/exporting 
data. For instance, some effort could be placed on comparing import/export tools against 
one another to discover which tools are best for the GCSS-MC data. Hadoop also offers a 
tool called Hive. Hive allows data to be loaded into Hadoop and SQL-like queries can be 
run on the data. Some effort might be placed on examining performance metrics between 
running a MapReduce analytic versus running the same analytic in Hive. 
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APPENDIX A; 

Hadoop Testing on Single and Two Node Clusters 


The Hadoop Distributed File System (HDFS) is a cluster computing technology that uses 
a parallel computing architecture to provide fast computing and redundant storage. The 
HDFS system is an open source project from Apache that is based off of Jeffrey Dean and 
Sanjay Ghemawat’s MapReduce paper [6]. The intent of HDFS is to bring a reliable fault 
tolerant parallel architecture to the open source community. 

The main idea behind the parallel computing architecture is to build an architecture that will 
allow faster computing than a single CPU will allow. Even with Moore’s Law producing 
computing power that doubles every twelve to eighteen months there is still a need for 
faster computing to deal with large amounts of data. Systems like HDFS hope to answer 
the demand of processing large amounts of data quickly over a distributed environment. 
HDFS allows processing of large amounts of data using a cluster with 1000s of nodes (i.e., 
networked computers). 

This is not to say that HDFS is a panacea and the answer for all problems. For instance 
some problems do not lend themselves to the MapReduce paradigm, that is to say that some 
problems will not see a speed up when computed in a cluster environment. The problems 
that lend themselves well to the MapReduce paradigm are those that can be split into small 
chunks and computed without changing the final outcome. The canonical example of this 
is the WordCount Program. This program takes multiple files as an input and computes 
the number of occurrence of each word in all of the input files. This can be illustrated by 
running the computation of the word count on each individual file on a separate node and 
then combining that result. This can be achieved significantly quicker than it would be to 
allow one computer to process each file individually. However, you can see that the addi¬ 
tional step of combining the intermediate files adds latency that would not exists if done by 
a single computer. That is considered to be overhead in the MapReduce algorithm. There 
are other sources of overhead that need to be considered as well (e.g., network latency, data 
locality, etc). 
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The purpose of this experiment is to analyze the speed up of a HDFS cluster. This will 
be tested running several benchmarks on two different clusters. The first cluster will be a 
single node cluster and the second cluster will be a two node cluster. Every effort will be 
kept to control the environment to show that any speed up or slowdown will be contributed 
to the HDFS architecture and not errors in the experiment. 

A.l Hypothesis 

The experiment will examine several benchmarks to produce a speed up factor over each 
benchmark. The hypothesis is that the two node cluster will perform better than the single 
node cluster. The two node cluster will perform around 1.75 times better than the single 
node cluster due to the overhead required by HDFS in the two node cluster to distribute the 
work between the nodes. 

A.2 Experiment Architecture 

The experiment utilized a virtual environment to set up the two clusters. The settings for 
each node were kept the identical in order to minimize any variance in the experiment setup. 
Each node was built in Oracle VirtualBox version 4.3.2 using a Finux Ubuntu release, see 
Table A.l for additional details. Each node was built separately and configured separately 
in VirtualBox. HDFSl was set up and configured to be the single node cluster. HDFS2 and 
HDFS3 were set up and configured to be a two node cluster. 


Table A.l: Initial Node Settings 


Node Name 

HDFSl 

HDFS2 

HDFS3 

os 

Ubuntu 13.10 (64Bit) 

Ubuntu 13.10 (64Bit) 

Ubuntu 13.10 (64Bit) 

RAM Size 

4096 MB 

4096 MB 

4096 MB 

HD Size 

100.77 GB 

100.77 GB 

100.77 GB 


In the configuration of the node set up there is a replication factor that is set. This factor 
is how many times the data is replicated across the nodes in a cluster. This factor was 
overlooked when the initial hypothesis was considered. When a node has to write the same 
data more than once the time to write is going to take longer. However, if you reduce that 
factor to one on a two node cluster than the read time takes longer, because the systems 
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experiences overhead when a node has to pass the data to the other node to read. The 
results can be found in Section A.4. The experiment architecture (see Figure A.l) was set 
up and tested to ensure the basic WordCount program would run. After the initial testing 
was run and confirmed to be accurate the nodes were ready to begin testing. 



Figure A.l: Experiment Architecture 


A.3 Experiments Run 

The tests that were run in this experiment were a write test, a read test, and a TeraSort test. 
These benchmarks are all available as part of the HDFS release. The tests were configured 
to maximize the effectiveness of the experimental architecture. The setup includes single¬ 
node and two-node configurations. When configuring our two-node system the replication 
factor was accidently set to two. This meant that when we are writing to HDFS the cluster 
has to write the data twice. We recognized this late into our experiment, but still had time 
to perform additional tests with a replication factor of one. Each configuration underwent 
twenty Read, Write, and TeraSort jobs. 

The Write test would generate ten files that are 1GB each, thus putting 10GB on the cluster. 
In between read jobs, the cluster would be cleaned to prohibit the drive from recognizing 
redundancy and refusing to write for subsequent tests. The Read test will read ten files of 
1GB each. We simply do not delete the files from the last run of the write test. Finally we 
subject each configuration to a TeraSort test. This test will first generate 10,000,000 rows 
of one-hundred byte data. In total this is 1GB of data to sort. After each run the output files 
will be removed. 

Each test would be ran twenty times against each configuration, we collected and summa- 
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rized the data below in the following eharts. 


A.4 Results 

As we mentioned previously, write testing on the two-node setup takes twenty percent 
longer than on a single node configuration (see Figure A.2) when using a replication factor 
of two, this is despite the fact that it is using twice the system resources. This can be 
attributed to the fact that our test VMs are using the same underlying hard-drives. If we 
had a true set of servers, this result set would likely look much different. Another reason for 
the slowdown is because the HDFS has to write the data twice for redundancy. A standard 
HDFS cluster uses triple-redundancy, but that overhead is shared over dozens of servers. 


Write Benchmark 

SingleNodeHadoop_RepFactor2 
SingleNodeHadoop_RepFactorl 
2NodeFladoop_RepFactor2 
2NodeFladoop_RepFactorl 
AVG Execution Time in Sec(s) o 50 lOO 150 200 250 300 
Figure A.2: Write Benchmark Results 



When we get to the Read tests (Figure A.2), our results are more as we expected. When 
reading in a DFS, if the node requesting data does not have it stored locally, it results in a 
performance impact. Thus our two-node setups lagged twenty percent and sixty-six percent 
respectively, behind the single node configuration. Interesting from an outside standpoint, 
the two-node double replication read ran quicker than a single replication read. This is 
a strength of a DFS, since it will front-load the processing time to write data to multiple 
nodes. The result is much faster read times; since the data is in multiple places for nodes 
to request it and in our simulation, that meant both nodes had a copy of the data to read. 
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Read Benchmark 

SingleNodeHadoop_RepFactor2 I 
SingleNodeHadoop_RepFactorl I 
2NodeFladoop_RepFactor2 I 
2NodeFladoop_RepFactorl I 
0 

AVG Execution Time in Sec(s) 
Figure A.3: Read Benchmark Results 


316.435 

98.35 

108.135 

147.305 

50 100 150 200 250 300 350 


In our TeraSort tests (Figure A.4), we found that utilizing a seeond node provided a elear 
improvement in average time. Both the single and dual replieation two node eonfigurations 
eompleted nearly sixty pereent faster than any single node eonfiguration. 


TeraSort Benchmark 




SingleNodeFladoop_RepFactor2 




935 

SingleNodeHadoop_RepFactorl 




455 

2NodeHadoop_RepFactor2 





2NodeHadoop_RepFactorl 





0 50 100 150 

200 


AVG Execution Time in Sec(s) 





Figure A.4: TeraSort Benchmark Results 


A.5 Conclusion 

We eoneluded that even though we were able to see performanee improvements with the 
TeraSort and Read benehmark our underlying arehiteeture prevented us from seeing that 
same sort of improvements in the Write tests. It is our belief that the main reason the write 
tests do not show a performanee inerease is eaused by the replieation faetor. Exaeerbating 
the issue was the faet the tests were performed in a virtual environment that has to aeeess 
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the same hard drive (HD) to write. In other words, the write operation ean not be done in 
parallel. 

In order to try and limit these side effeets we attempted to alter the replieation faetors to 
try and to gain an equal eomparison. This effort led to more data that eonfirmed some of 
our thoughts. At first, we foreed the two node eluster to a replieation faetor of one. So with 
the replieation faetor of one being the same between the single node and two node then the 
write should perform the same or better on the two node system due to the bottleneek of the 
HD speed. However, what we saw was the two node eluster took longer to write the same 
amount of data. After studying the HDFS arehiteeture we eame to the eonelusion that the 
replieation faetor sets how many times the data is written, but not where the data is written. 
So, on the two node eluster the master node (NameNode) is going to try to balanee the 
data on both nodes and in doing so is going to experienee network overhead whieh is what 
eauses the a delay. Additionally, beeause the nodes are both trying to write at the same time 
to the same physieal HD they are going to experienee a wait for the aecess to the physieal 
drive. With the physieal HD bottleneek and some network overhead it is logieal to think 
that it ean aeeount for the additional 18ms it takes to write the data on the two node eluster. 

Moreover, we set both the two node eluster and the single node eluster to a replieation 
faetor of two. This seems unrealistie on a single node system because the single node is 
just going to write the same data twice on the same HD. Nevertheless, the tests were run in 
an attempt to gain a fair level of comparison. What we found was the single node performed 
better than the two node cluster on the write test. Again we can contribute this to the fact 
that the two node system is spitting the work up between the nodes causing some network 
overhead and when the actual write happens it is writing to the same physical HD causes 
a slowdown as the process has to wait for access to the HD. With the HD factor and the 
network overhead the slowdown of 41ms seems logical. 

After performing all of these tests and studying the HDFS architecture we belief on order to 
make a fair comparison of speed up would be to perform the test on physical cluster rather 
than virtual instances of clusters. The test would also be more effective if we could run the 
tests on say a three node cluster verse a six node cluster. These are still both considered 
small clusters, but they should be big enough to use a default replication factor of 3 and see 
the performance increase at a fair level of comparison. 
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APPENDIX B; 

How to Setup a Single Node Hadoop Cluster 


This tutorial will guide you through installing a single node Hadoop cluster. This tutorial 
was built using Tom White’s Hadoop: The Definitive Guide [7] and Michael Noll’s Hadoop 
Tutorials [17]. A single node cluster is a great way to start working with the Hadoop 
Distributed File System and the MapReduce paradigm. Once you have installed and tested 
the cluster, all done in this tutorial, you will have your environment set up to start testing 
MapReduce code. 

Notes: 

• This tutorial assumes that you are installing the node on a debian based linux plat¬ 
form. 

• This can be done on actual hardware or on a virtual machine. 

• It is recommended that you have at least 4GB of RAM and 100GB of free hard disk 
space. 

1. open up a terminal 

2. Update current packages 

(a) sudo apt-get update 

3. Install Java 

(a) sudo apt-get install openjdk-7-jdk 

4. Verify the install 

(a) java -version 

5. Install nano 

(a) sudo apt-get install nano 

• this is a terminal text editor. You may skip this if you choose to use another 
terminal text editor (i.e. VI or emacs) 

6. Hadoop uses ssh to talk from the local machine to the namenode. We have to config¬ 
ure ssh for that. 

(a) ssh-keygen -t rsa -P "" 
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• save to default file 

(b) if you do not have a ssh server install one 

• sudo apt-get install openssh-server 

7. add the key we just created to the authorized key file 

(a) cat $HOME/.ssh/id_rsa.pub >> $H0ME/.ssh/authorized_keys 

8. test the key 

(a) shh localhost 

• should give you a message that localhost has been added to list of known 
hosts 

9. Hadoop and Ubuntu have a conflict with IPV6. The workaround we will use is to 
disable IPV6 

(a) sudo nano /etc/sysctl.conf 

• add these lines to the end of the file 

- # disable ipv6 

- net.ipv6.conf.all.disable_ipv6 = 1 

- net.ipv6.conf.default.disable_ipv6 = 1 

- net.ipv6.conf.lo.disable_ipv6 = 1 

10. In order for the changes to take effect you have to reboot the machine 

(a) sudo reboot 

11. once the machine has rebooted, open a terminal 

12. check to see if the changes took place 

(a) cat /proc/sys/net/ipv6/conf/all/disable_ipv6 

• you want to see a return of 1 

13. change directory 

(a) cd /usr/local 

14. Download Hadoop 

(a) sudo Wget http://download.nextag.eom/apache/hadoop/common/hadoop-l.2.l/hadoop-15 

15. extract the tarball 

(a) sudo tar xzf hadoop-1.2.1.tar.gz 

16. move the folder and change permissions 

(a) sudo mv hadoop-1.2.1 hadoop 

(b) sudo chown -R <your username>:<your group> hadoop 
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17. update the .bashre file for the hdfsuser 

(a) sudo nano /home/<your username>/.bashre 

• add the following lines to the file 

- export HADOOP_HOME=/usr/local/hadoop 

- export JAVA_H0ME=/usr/lib/jvm/java-7-openjdk-amd64 

- unalias fs &> /dev/null 

- alias fs="hadoop fs" 

- unalias his &> /dev/null 

- alias hls="fs -Is" 

- Izohead () { 

- hadoop fs -cat $1 Izop-do I head-1000 I less I 

- > 

- export PATH=$PATH:$HAD00P_H0ME/bin 

18. ereate the direotory that Hadoop will use to store files 

(a) sudo mkdir -p /app/hadoop/tmp 

(b) sudo chown <your username>:<your group> /app/hadoop/tmp 

19. Edit eonfigurations files for use in your environment and point to the direetory we 
just oreated 

(a) ohange the Hadoop-env.sh file for the java jdk you installed 

• sudo nano /usr/local/hadoop/conf/hadoop-env.sh 

- uneomment the export JAVA_HOME and give it oorreet path 

* export JAVA_H0ME=/usr/lib/jvm/java-7-openjdk-amd64 

20. ehange the oore-site xml file 

(a) sudo nano /usr/local/hadoop/conf/core-site.xml 

• add the following in between the eonfiguration tags 

- <property> 

- <name>hadoop.tmp.dir</name> 

- <value>/app/hadoop/tmp</value> 

- </property> 

- <property> 

- <name>fs.default.name</name> 

- <value>hdfs://localhost:54310</value> 
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- </property> 

21. change the mapred-site.xml file 

(a) sudo nano /usr/local/hadoop/conf/mapred-site.xml 

• add the following in between the eonfiguration tags 

- <property> 

- <name>mapred.job.tracker</name> 

- <value>localhost:54311</value> 

- </property> 

22. ehange the hdfs-site.xml 

(a) sudo nano /usr/local/hadoop/conf/hdfs-site.xml 

• add the following in between the eonfiguration tags 

- <property> 

- <name>dfs.replication</name> 

- <value>l</value> 

- </property> 

23. Format the HDFS file system and the name node 

(a) /usr/local/hadoop/bin/hadoop namenode -format 

• you should see some output during ereation 

24. start your eluster 

(a) cd /usr/local/hadoop 

(b) bin/start-all.sh 

25. run jps to see that all of the nodes started 

(a) jps 

26. Download book for wordeount 

(a) make direetory on the loeal file system 

• mkdir /tmp/book 

(b) ed to that direetory 

• cd /tmp/book 
(e) download book 

• wget http://www.textfiles.com/games/abc.txt 
(d) copy files from local file system to hdfs 

• cd /usr/local/hadoop 
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• bin/hadoop dfs -copyFromLocal /tmp/book /user/book 
(e) check to see if it is there 

• bin/hadoop dfs -Is /user/ 

• bin/hadoop dfs -Is /user/book 

27. run pre-compiled wordcount 

(a) bin/hadoop jar hadoop*examples*.jar wordcount /user/book /user/book-output 

28. make sure the output is there and look at it 

(a) bin/hadoop dfs -Is /user/ 

(b) bin/hadoop dfs -Is /user/book-output 

(c) bin/hadoop dfs -cat /user/book-output/part-r-00000 

29. create a local directory and move files there 

(a) mkdir /tmp/book-output 

(b) bin/hadoop dfs -getmerge /user/hdfsuser/books-output-3 /tmp/book_output 

30. stop cluster 

(a) bin/stop-all.sh 

This completes the tutorial. You are now reading to begin learning how to use the Map Re¬ 
duce paradigm and writing your own programs. As a place to start, I recommend modifying 
the Wordcount example on the Hadoop website. 
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